Direct2D: How to implement deferred registration of custom effects

One weird trick …

If you’re writing a custom effect for Direct2D, which involves implementing the ID2D1EffectImpl interface, you may run into a chicken-and-egg problem if you’re trying to use another custom effect that isn’t yet registered with the Direct2D factory.

Direct2D is set up very carefully to prevent the custom effect from getting access to things like the ID2D1DeviceContext. Unfortunately, you also don’t get access to the ID2D1Factory1, and this creates a brick wall that blocks you from registering effects just-in-time (versus ahead-of-time).

When Direct2D calls your ID2D1EffectImpl::Initialize() method, it’ll provide you with an ID2D1EffectContext and an ID2D1TransformGraph. The ID2D1EffectContext::CreateEffect() method allows you to instantiate another effect, which you can then use in a call to ID2D1EffectContext::CreateTransformNodeFromEffect() to create an ID2D1TransformNode that you can place into your transform graph.

However, there are no registration methods on ID2D1EffectContext like you see on ID2D1Factory1. This means that you have to register your effects up-front, probably in the custom effect’s static Register method (which is the pattern used in the Direct2D documentation).

Notably, this requires that your custom effect has what amounts to a dependency manifest in its Register method whereby it calls all of the Register methods for the custom effects that it needs to utilize later. This can be seen as a violation of the DRY (“Don’t Repeat Yourself") principle, and can be a source of subtle bugs and frustrating debugging if it gets out of date.

However, you can still implement a proper deferred registration mechanism. Here’s how I do it in Paint.NET, although I won’t show the actual code because it’s in C# with lots of my own custom things that won’t make sense out-of-context:

Step 1: Implement your custom effect all the way up through ID2D1EffectImpl::Initialize(). Note that this method receives an ID2D1EffectContext and an ID2D1TransformGraph.

At this point you’ll be wanting to use ID2D1EffectContext::CreateEffect(), followed by ID2D1EffectContext::CreateTransformNodeFromEffect(), and then an appropriate method on the ID2D1TransformGraph to add the effect’s transform node into your transform graph.

Step 2: If ID2D1EffectContext::CreateEffect() returns an error, then either the effect can’t be created or it hasn’t yet been registered. If it didn’t return an error, then go about your business normally. Otherwise, proceed to step 3.

NOTE: Direct2D is supposed to return D2DERR_EFFECT_IS_NOT_REGISTERED (0x88990028), but it doesn’t. It’ll actually return HRESULT_FROM_WIN32(ERROR_NOT_FOUND), which is 0x80070490. I check for both, just in case Microsoft decides to fix its code.

Step 3: At this point you’re normally SOL, as we need to register the other custom effect we want to use, but we don’t have an ID2D1Factory1 to call a RegisterEffect method on.

So, step 3 is to actually create an effect that does exist using ID2D1EffectContext::CreateEffect(). I use CLSID_D2D1Flood, as it’s a very simple effect and probably cheap. We won’t be using the effect for its intended purpose – it’s a tool to get what we really want.

Step 4: Now that you have an ID2D1Effect, we’re still not quite where we want to be because it doesn’t implement ID2D1Resource which has the GetFactory() method that we need to get the ID2D1Factory that has the RegisterEffect methods we need. If only there were a way …

But we’re not stuck yet! Either call ID2D1Effect::GetOutput() to get the effect’s ID2D1Image, or use QueryInterface to retrieve a pointer to the ID2D1Image interface. The documentation for GetOutput establishes these as equivalent, btw.

Step 5: Now you can call ID2D1Resource::GetFactory() on the effect’s ID2D1Image pointer, and then do the obvious thing … (e.g. QI for ID2D1Factory1, then use that for registering the other effect). I like to call ID2D1Factory1::GetEffectProperties() first to try and distinguish between the effect not being registered versus the effect’s initialization returning an error. If the effect is already registered then just bail and return a failure HRESULT instead of trying to forge ahead with re-registering the effect.

Step 6: Don’t forget to release all of the temporary objects.

An important thing here is that effect registration works correctly when called from within the initialization of another effect. Hopefully it stays that way!

Anyway, this was one weird trick I found that enabled a lot of my code to be simpler. It’s still better if you can find a different way to provide access to the ID2D1Factory1 for your custom effect, but this is a good fallback technique.

In an upcoming update, Paint.NET will be opening the doors for effect plugins to use Direct2D, to implement custom Direct2D effects, to chain these effects together, and to not have to worry about up-front registration of custom effect dependencies. Stay tuned!

Win32: How to get the refresh rate for a window

I mentioned in a previous blog post that I had finally written the code for Paint.NET so that it would run the animation timer at the correct rate. Namely, at the refresh that the monitor is actually running at instead of a constant 120 Hz.

For the 4.0 release, I chose 120 Hz as a compromise to not delay the release, partly because I’d been working on 4.0 for 5 years and was exhausted – any further delay was just not okay. I had already spent a bunch of time researching how to do this, but had not yet been successful in putting all the pieces together, and couldn’t justify spending more time on it.

The Code

So without further ado, here’s some code that shows exactly how to do this: https://github.com/rickbrew/RefreshRateWpf/

The 4 Steps

Here are the 4 steps that it takes to get the refresh rate:

Step 1: Get an HWND

An HWND is a “handle to a window.” This is pretty easy. Just do it.

In raw Win32, you should already have this. MFC, ATL, and WTL are super easy too.

In WinForms, just grab the value of the Form’s Handle property.

In WPF, you’ll want WindowInteropHelper’s help.

In UWP, you’ll want the help of ICoreWindowInterop.

Step 2: Get the HMONITOR

Thankfully this is pretty easy too, thanks to the MonitorFromWindow function. If your window is straddling multiple monitors then this will return the “best” one, at Windows’ discretion. Enumerating all of those monitors is probably much more complex, but I’d be surprised if it’s not possible somehow.

Step 3: Get the MONITORINFOEX

In other words, get information about the monitor. Again, the aptly named GetMonitorInfo function helps us out here.

Step 4: Get the monitor’s display settings

This includes the resolution and refresh rate and … all sorts of weird information that probably isn’t relevant in this decade. We using EnumDisplaySettings for this, which isn’t as obvious if you’re not well-versed in Win32 API patterns.

Also, note that I’m detailing how to query for this information. I didn’t look into how to get a notification for when this changes. I ultimately decided to key off of the standard window activation/focus events, because in order to change the refresh rate you have to click over to a control panel and back. I’m pretty sure WM_DISPLAYCHANGE would be more precise, but I didn’t verify this (it doesn’t say anything about refresh rate changes).

The Backstory

That didn’t seem too bad … so why did I say that this so difficult? The code is only a single page long in C#, and most of it’s just interop definitions!

Well, this is Win32 we’re talking about. Everything’s cryptic, and a lot of the real documentation is buried in the tribal knowledge of various Microsoft engineers. You may only end up with a few paragraphs of code, but it’s often a lot of work to get there (and this isn’t unique to Win32: I just described a lot of software development!).

DirectX, or rather DXGI, turned out to be more readable but way more complex, as we’ll see soon. I didn’t end up needing DirectX’s help in the end, as the code can testify to.

DXGI rabbit hole

My first research attempts were in DXGI to try and find this information. The DXGI_MODE_DESC has the refresh rate right in plain sight, and it’s a DXGI_RATIONAL so maybe it’ll be more accurate than an integer for those times when a display is running at something weird like 59.97 Hz.

Unfortunately, I was never able to find out how to query the current display mode for a specific monitor, window, or render target.

I got pretty close though:

1. Starting with ID2D1RenderTarget, call QueryInterface() to retrieve an interface pointer for ID2D1DeviceContext.

2. On ID2D1DeviceContext, call GetDevice() to get the ID2D1Device.

3. On the ID2D1Device, call QueryInterface() to retrieve an interface pointer for ID2D1Device2.

4. On the ID2D1Device2, call GetDxgiDevice() to get the IDXGIDevice.

5. On the IDXGIDevice, call GetAdapter() to get the IDXGIAdapter

From here you will need to use EnumOutputs to enumerate the IDXGIOutputs, then call GetDesc() on each one until you find one with the right HMONITOR for your HWND. Which means you can’t actually start from your render target (or device context) and get to the refresh rate. You still need some external information, namely the HWND, to get there.

IDXGIOutput also has FindClosestMatchingMode, which sounds promising, but it’s no help at all for what we need.

IDXGIOutput::GetDisplayModeList allows you to enumerate all the modes, but not the current mode. This just seems like an omission to me. IDXGIOutput1 through 5, retrievable via QueryInterface(), don’t have anything to help here either.

Not being able to go from the render target to the refresh rate actually makes sense, since a render target doesn’t have to be attached to a monitor (it could be pointed at a bitmap). However, not being able to query the monitor’s current mode is not something I’ve come up with a plausible explanation for.

Maybe I’m wrong and there is a way to do this with DXGI – like maybe the first entry in GetDisplayModeList is the current mode by convention. The documentation says nothing about this, however.

So, DXGI turned out to be an empty rabbit hole that left me frustrated and so I shelved the problem for a later date.

Conclusion

But, I did finally get it working! As is often the case, it mostly required deciding that this really was the most important thing to work on at the time (prioritization, in other words). Then I sat down for a few hours, did the research, wrote and experimented with some code in C so I wouldn’t have to worry about interop definitions, then ported it to a little C#/WPF sample app, debugged the interop bugs, and then it was ready for integration into Paint.NET (another hour or two).

And now, finally, as of version 4.0.17, Paint.NET is using a lot less CPU time any time it does any animations. This made opening many images run a lot faster (you know you can multiselect with File->Open, right?) – the image thumbnail lists does all sorts of neat animations when you’re doing that.

ConcurrentDictionary allocates … a lot

I was inspecting the latest build of Paint.NET with SciTech Memory Profiler  and noticed that there were a lot of System.Object allocations. Thousands of them … then, tens of thousands of them … and when I had opened 100 images, each of which were 3440×1440 pixels, I had over 800,000 System.Objects on the heap. That’s ridiculous! Not only do those use up a ton of memory, but they can really slow down the garbage collector. (Yes, they’ll survive to gen2 and live a nice quiet retired life, for the most part … but they also have to first survive a gen0 and then a gen1 collection.)

Obviously my question was, where are these coming from?! After poking around in the object graph for a bit, and then digging in with Reflector, it eventually became clear: every ConcurrentDictionary was allocating an Object[] array of size 128, and immediately populating it with brand new Object()s (it was not lazily populated). And Paint.NET uses a lot of ConcurrentDictionarys!

Each of these Objects serves as a locking object to help ensure correctness and good performance for when there are lots of writes to the dictionary. The reason it allocates 128 of these is based on its default policy for concurrencyLevel: 4 x ProcessorCount. My system is a 16-core Dual Xeon E5-2687W with HyperThreading, which means ProcessorCount = 32.

There’s no way this level of concurrency is needed, so I quickly refactored my code to use a utility method for creating ConcurrentDictionary instead of the constructor. Most places in the code only need a low concurrency level, like 2-4. Some places did warrant a higher concurrency level, but I still maxed it out at 1x ProcessorCount.

Once this was done, I recreated the slightly contrived experiment of loading up 100 x 3440×1440 images, and the System.Object count was down to about ~20,000. Not bad!

This may seem like a niche scenario. “Bah! Who buys a Dual Xeon? Most people just have a dual or quad core CPU!” That’s true today. But this will become more important as Intel’s Skylake-X and AMD’s Threadripper bring 16-core CPUs much closer to the mainstream. AMD is already doing a fantastic job with their mainstream 8-core Ryzen chips (which run Paint.NET really fantastically well, by the way!), and Intel has the 6-core Coffee Lake headed to mainstream systems later this year. Core counts are going up, which means ConcurrentDictionary’s memory usage is also going up.

So, if you’re writing a Windows app with the stock .NET Framework and you’re using ConcurrentDictionary a lot, I’d advise you to be careful with it. It’s not as lightweight as you think.

(The good news is that Stephen Toub updated this in the open source .NET Core 2.0 so that only 1x ProcessorCount is employed. Here’s the commit. This doesn’t seem to have made it into the latest .NET Framework 4.7, unfortunately.)

.NET Strings are immutable … except that they’re not

(Side note: expect a Paint.NET 4.0.6 update soon, hopefully timed with the availability of Windows 10. It’s not just a bugfix release, either. More info soon!)

If you have some code whose performance is limited by string building or concatenation, you may want to read this. For everyone else: close your eyes! Here be dragons! (just kidding, but seriously, be careful with this stuff)

Everybody knows that strings are immutable in .NET. When I first tried out C# (in 2003?) this annoyed me, as I was used to C and C++ and being able to do whatever I wanted. I quickly learned to love this property, and embraced immutability and functional programming style in general. The benefits far outweigh the initial clumsiness.

Everybody also knows that you can pin a string with unsafe and fixed, get a pointer to the first character, and read directly from it or even hand it off the native code. No copying required. Right? Right?!

And following from that, there’s a dirty little secret here that nobody really likes to talk about: .NET strings aren’t actually immutable. You can get a pointer to one, and you can write to that pointer because it’s allocated in the regular heap and not in some special read-only memory pages (e.g. via VirtualProtect). Strings in .NET don’t cache their hash code, therefore there’s nothing to cause an inconsistency if you do something like …

String s = “I’m immutable, right?”;
unsafe
{
    fixed (char* p = s)
    {
        *(p + 4) = ‘_‘;
        *(p + 5) = ‘_‘;        
    }
}

Console.WriteLine(s); // outputs “I’m __mutable, right?”;

Note that if you do this anywhere other than while initializing a string, be prepared for the local townsfolk to come knocking on your door. With torches and pitchforks. It’s very important to respect the immutability contract once you hand off a String to external code.

If we browse around with Reflector or over on github in the coreclr repo, we can see that .NET itself takes advantage of this in functions such as System.String.Concat()  and its helper function FillStringChecked():

public static String Concat(String str0, String str1, String str2) {
    // … parameter checking and other stuff omitted for brevity …
    int totalLength = str0.Length + str1.Length + str2.Length;
    String result = FastAllocateString(totalLength);
    FillStringChecked(result, 0, str0);
    FillStringChecked(result, str0.Length, str1);
    FillStringChecked(result, str0.Length + str1.Length, str2);
    return result;
}

unsafe private static void FillStringChecked(String dest, int destPos, String src)
{
    // … parameter checking omitted for brevity
    fixed(char *pDest = &dest.m_firstChar)
    fixed (char *pSrc = &src.m_firstChar) {
        wstrcpy(pDest + destPos, pSrc, src.Length);
    }
}

And wstrcpy just calls into some regular ol’ memcpy/memmove type code. Some of those code paths (via wstrcpy) are managed code, others eventually thunk into native, but none of it’s magic: you can do all of this yourself! You can’t call FastAllocateString()*, but you can still allocate a string of a specified length via new String(char c, int count). Just use a placeholder character for the first parameter, like a space or even null ‘’. Also note that all .NET strings are null-terminated, so if you create one of length N, you actually have N+1 characters. The memory layout intentionally matches BSTR which gives a lot of flexibility for passing things to native code without having to do any copying. (I don’t know what would happen if you overwrote the length prefix, but feel free to try it out and post a comment!)

With this information you can build your own versions of String.Concat or StringBuilder that can avoid making copies in specialized situations. However, I’m not actually sure how useful this is: I looked through the Paint.NET codebase and found only a handful of places that do lots of string manipulation, and even then they were mostly in error reporting code paths whose performance was entirely unimportant. I decided to leave the code alone, preferring safety over a little bit of performance (and fun and creativity). Your Mileage May Vary.

On a side note, the fact that you can do this at all is one of the reasons why I love .NET and C#. You get all the good stuff that a managed runtime and high-level language is supposed to provide, but you also get “escape hatches” to drill through the glass floors on the abstraction ladder. You can push the runtime and its safety guarantees out of the way in the name of performance (or for other reasons). Paint.NET takes advantage of this a lot: graphics code loves things like pointers and structs. Lately I’ve been doing a lot of work on Android** with Java, and the lack of things like pointers and structs … gosh, it just makes things so much more difficult to work with from a performance standpoint!

* Well, maybe you could with reflection Winking smile That would probably eliminate its “fast” property though.

** Sorry, not for Paint.NET.

What’s coming in 4.0.4

Whew, it’s been awhile since there was an update to paint.net 4.0. It’s finally time to start talking about the next one!

I’ve been slowly chipping away at various little things in Paint.NET, especially with respect to performance. And also some missing features and bug fixes.

Here’s a preview of what’s coming in 4.0.4:

  • The ability to choose a “Fill” is coming back for the Paintbrush tool. I didn’t think many people would miss it and so I left it out of 4.0 in order to save time for other things. I was wrong on two fronts here: 1) people do need it, and 2) it only took a few minutes to implement. So, in 4.0.4, it’ll be back.
  • Dramatically faster Magic Wand tool. I came up with a new algorithm to do the reverse scan-conversion that is needed here. On small images you probably won’t notice, but if you’ve ever had the Magic Wand tool “take forever” then the next update is going to be your favorite one of all time. I’ve seen the Magic Wand go from taking 60+ seconds in 4.0.3 to taking maybe 2 or 3 seconds in 4.0.4.
  • The improvements to the Magic Wand tool’s reverse scan-conversion algorithm have also been applied to make aliased selections work a lot faster. (Actually I’ll tell the proper truth here: This work was originally intended as a challenge to myself to see if I could make the aliased selection rendering run a lot faster. It just so happened to be more relevant for the Magic Wand Smile)
  • Much faster performance overall. I’ve chipped away at performance in the areas of antialiased selections, selection clipping, and so on. Anything involving selections should be a lot faster. In particular, the Move Selected Pixels tool has a fantastically smoother framerate. Some of this optimization just involved a better choice of sorting algorithm (e.g. by using introspective sort instead of raw quicksort). I have also aggressively applied a coding pattern in C# which allows elision of virtual function calls (via inlining) by combining generics, structs, and interfaces. If anyone is interested in hearing the details, please let me know in the comments below, as I’m not sure this has been discussed much in the broader C#/.NET community. This pattern is responsible for the first 20% performance gain in my C# implementations of quicksort and introspective sort, so it’s nothing to sneeze at Smile Other performance improvements came about by being smarter about how selection mask generation was parallelized with other computations, by replacing integer divisions with multiply/add/shift combinations, and so on. Windows Performance Analyzer is your friend, folks! I’ve been using it extensively to optimize paint.net 4.0 (both now and in the past).
  • Other miscellaneous bug fixes, of course. The Gear shape had some glitches (oops), the Text tool produced foul output if you used its Smooth rendering mode while antialiasing was disabled, and Edit –> Clear wasn’t behaving the same as it was in v3.5 and was causing problems because of it.

“Reverse scan-conversion” is the inverse of polygon scan conversion (also known as rasterization). It’s the process of taking a list of non-overlapping rectangles with integer coordinates and extents (e.g. System.Windows.Int32Rect) and stitching them together to form a polygon (a list of points in 2D space). In Paint.NET this is used when you have chosen to use aliased selections: your selection’s polygonal outline is scan-converted (rasterized) to produce a list of integer rectangles which approximate its interior (this is a well-known algorithm in computer graphics). This list of rectangles (or “scans”) is then fed into the next algorithm which stitches the rectangles together to create a “pixelated” (or rectilinear) polygon outline. In 4.0.3 and before, each rectangle was first transformed into a closed polygon and then they were all run through GPC’s union algorithm. This ensured that any touching edges were coalesced. Unfortunately, GPC just does not like this scenario and performs excruciatingly slow, and sometimes even crashes via stack overflow.

As it turns out, this has a lot of overlap with what the Magic Wand tool does. It too needs to take a list of rectangles and transform it into a polygon, although the source is different. We start by applying ye ol’ flood fill algorithm, with tolerance thresholding, to the image in order to create a bitmask that tells us which pixels are included or not. This bitmask is then converted into a list of rectangles via a process akin to RLE compression. And then that list of rectangles must be converted into a polygon because that’s the way Paint.NET’s selection system works (it is geometry based: it needs polygons!). This is what takes up about 99% of the CPU time when you see the Magic Wand “taking forever,” and by reducing it from O(n^m) down to O(n), much time and frustration is saved. (I’m doing a little hand-waving here by not specifying what ‘n’ is. Also, I haven’t actually analyzed GPC’s code to figure out what m is, although it’s probably 2. Have you tried analyzing its code? It’s intense Smile)

Almost all of the CPU time is now spent drawing the selection outline with Direct2D. A future update might optimize that too since Direct2D is using a general purpose algorithm, something which is overkill when all I need is to apply a 1 pixel stroke to a polygon.

More on Multicore JIT

Last month I posted about how you can use Multicore JIT, a .NET 4.5 feature, even if your app is compiled for .NET 4.0. It’s a great feature which can help your app’s startup performance with very little code change: it’s essentially free. In some cases (e.g. ASP.NET) it’s enabled automatically.

But wait, there’s more!

Over on the .NET Framework Blog, Dan Taylor has posted a really good write-up about Multicore JIT with graphs showing the performance improvements when it’s applied to Paint.NET 4.0, Windows Performance Analyzer, and Windows Assessment Console. I was involved in the code changes for each of these (obviously for Paint.NET), which leads into the next link …

We also did a video interview with Vance Morrison (Performance Architect on .NET) which is now posted over at Channel 9: http://channel9.msdn.com/posts/net-45-multicore-jit

Using Multi-Core JIT from .NET 4.0 if .NET 4.5 is installed

.NET Framework 4.5 contains a very cool new feature called Multi-Core JIT. You can think of it as a profile-guided JIT prefetcher for application startup, and can read about it in a few places …

I’ve been using .NET 4.0 to develop Paint.NET 4.0 for the past few years. Now that .NET 4.5 is out, I’ve been upgrading Paint.NET to require it. However, due to a circumstance beyond my control at this moment, I can’t actually use anything in .NET 4.5 (see below for why). So Paint.NET is compiled for .NET 4.0 and can’t use .NET 4.5’s features at compile time, but as it turns out they are still there at runtime.

I decided to see if it was possible to use the ProfileOptimization class via reflection even if I compiled for .NET 4.0. The answer: yes! You may ask why you’d want to do this at all instead of biting the bullet and requiring .NET 4.5. Well, you may need to keep your project on .NET 4.0 in order to maintain maximum compatibility with your customers who aren’t yet ready (or willing Smile) to install .NET 4.5. Maybe you’d like to use the ProfileOptimization class in your next “dot release” (e.g. v1.0.1) as a free performance boost for those who’ve upgraded to .NET 4.5, but without displacing those who haven’t.

So, here’s the code, which I’ve verified as working just fine if you compile for .NET 4.0 but run with .NET 4.5 installed:

using System.Reflection;

Type systemRuntimeProfileOptimizationType = Type.GetType("System.Runtime.ProfileOptimization", false);
if (systemRuntimeProfileOptimizationType != null)
{
    MethodInfo setProfileRootMethod = systemRuntimeProfileOptimizationType.GetMethod("SetProfileRoot", BindingFlags.Static | BindingFlags.Public, null, new Type[] { typeof(string) }, null);
    MethodInfo startProfileMethod = systemRuntimeProfileOptimizationType.GetMethod("StartProfile", BindingFlags.Static | BindingFlags.Public, null, new Type[] { typeof(string) }, null);

    if (setProfileRootMethod != null && startProfileMethod != null)
    {
        try
        {
            // Figure out where to put the profile (go ahead and customize this for your application)
            // This code will end up using something like, C:\Users\UserName\AppData\Local\YourAppName\StartupProfile\
            string localSettingsDir = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData);
            string localAppSettingsDir = Path.Combine(localSettingsDir, "YourAppName");
            string profileDir = Path.Combine(localAppSettingsDir, "ProfileOptimization");
            Directory.CreateDirectory(profileDir);

            setProfileRootMethod.Invoke(null, new object[] { profileDir });
            startProfileMethod.Invoke(null, new object[] { "Startup.profile" }); // don’t need to be too clever here
        }

        catch (Exception)
        {
            // discard errors. good faith effort only.
        }
    }
}

I’m not sure I’ll be using this in Paint.NET 4.0 since it uses NGEN already, but it’s nice to have this code snippet around.

So, why can’t I use .NET 4.5? Well, they removed support for Setup projects (*.vdproj) in Visual Studio 2012, and I don’t yet have the time or energy to convert Paint.NET’s MSI to be built using WiX. I’m not willing to push back Paint.NET 4.0 any further because of this. Instead, I will continue using Visual Studio 2010 and compiling for .NET 4.0 (or maybe I’ll find a better approach). However, at install time and application startup, it will check for and require .NET 4.5. The installer will get it installed if necessary. Also, there’s a serialization bug in .NET 4.0 which has dire consequences for images saved in the native .PDN file format, but it’s fixed in .NET 4.5 (and for .NET 4.0 apps if 4.5 just happens to be what’s installed).

Marshaling native arrays back as managed arrays without copying

I’ve come up with a trick that can be used in some very specific scenarios in order to avoid extra array copying when calling into native code from managed code (e.g. C#). This won’t usually work for regular P/Invokes into all your favorite Win32 APIs, but I’m hopeful it’ll be useful for someone somewhere. It’s not even evil! No hacks required.

Many native methods require the caller to allocate the array and specify its length, and then the callee fills it in or returns an error code indicating that the buffer is too small. The technique described in this post is not necessary for those, as they can already be used optimally without any copying.

Instead, let’s talk about the general problem if you’re calling a native method which does the array allocation and then returns it. You can’t use it as a “managed array” unless you copy it into a brand new managed array (don’t forget to free the native array). In other words, native { T* pArray; size_t length; } cannot be used as a simple managed T[] as-is (or even with modification!). The managed runtime didn’t allocate it, won’t recognize it, and there’s nothing you can do about it. Very few managed methods will accept a pointer and a length, and will require a managed array. This is particularly irksome when you want to use System.IO.Stream.Read() or Write() with bytes from a native-side buffer.

Paint.NET uses a library written in classic C called General Polygon Clipper (GPC), from The University of Manchester, to perform polygon clipping. This is used for, among other things, when you draw a selection with a mode such as add (union), subtract (exclude), intersect, and invert (“xor”). I blogged about this 4 years ago when version 3.35 was about to be released: using GPC made these operations immensely faster, and I saved a lot of time and headache by purchasing a commercial use license for the library and then integrating it into the Paint.NET code base. tl;dr: The algorithms for doing this are nontrivial and rife with special corner cases, and I’d been struggling to find enough sequential time to implement and debug it on my own.

Anyway, the data going into and coming out of GPC is an array of polygons. Each polygon is an array of points, each of which is just a struct containing X and Y as double-precision floating point values. To put it simply, it’s just a System.Windows.Point[][] (I actually use my own geometry primitives nowadays, but that’s another story, and it’s the same exact thing).

Getting this data into GPC from the managed side is easy. You pin every array, and then hand off the pinned pointers to GPC. Since you can’t use the “fixed” expression with a dynamic number of elements, I use GCHandle directly and stuff them all into GCHandle[] arrays for the duration of the native call. This is great because on the managed side I can work with regular ol’ managed arrays, and then send them off to GPC as “native arrays” by pinning them and using the pointers obtained from GCHandle.AddrOfPinnedObject().

Now, here’s the heart breaking part. GPC allocates the output polygon using good ol’ malloc*. So when I get the result back on the managed side, I must copy every single last one so that I can use it as a Point[] (a managed array). This ends up burning a lot of CPU time, and can cause virtual address space claustrophobia on 32-bit/x86 systems when working with complex selections (e.g. Magic Wand), as you must have enough memory for 2 copies of the result while you’re doing the copying. (Or you could free each native array after you copy it into a managed array, but that’s an optimization for another day, and isn’t as straightforward as you’d think because freeing the native memory requires another P/Invoke, and those add up, and so it might not actually be an optimization.)

But wait, there’s another way! Since the code for GPC is part of my build, I can modify it. So I added an extra parameter called gpc_vertex_calloc:

    typedef (gpc_vertex *)(__stdcall * gpc_vertex_calloc_fn)(int count);

    GPC_DLL_EXPORT(
    void, gpc_polygon_clip)(
        gpc_op                set_operation,
        gpc_polygon          *subject_polygon,
        gpc_polygon          *clip_polygon,
        // result_polygon holds the arrays of gpc_vertex, aka System.Windows.Point)
        gpc_polygon          *result_polygon, 
        gpc_vertex_calloc_fn  gpc_vertex_calloc);

("gpc_vertex” is GPC’s struct that has the same layout as System.Windows.Point: X and Y, defined as a double.)

In short, I’ve changed GPC so that is uses an external allocator by passing in a function pointer it should use instead of malloc. And now if I want I can have it use malloc, HeapAlloc, VirtualAlloc, or even the secret sauce detailed below.

On the managed side, the interop delegate definition for gpc_vertex_calloc_fn gets defined as such:

    [UnmanagedFunctionPointer(CallingConvention.StdCall)]
    public delegate IntPtr gpc_vertex_calloc_fn(int count);

And gpc_polygon_clip’s interop defintion is like so:

    [DllImport(“PaintDotNet.SystemLayer.Native.x86.dll”, CallingConvention = CallingConvention.StdCall)]
    public static extern void gpc_polygon_clip(
        [In] NativeConstants.gpc_op set_operation,
        [In] ref NativeStructs.gpc_polygon subject_polygon,
        [In] ref NativeStructs.gpc_polygon clip_polygon,
        [In, Out] ref NativeStructs.gpc_polygon result_polygon,
        [In] [MarshalAs(UnmanagedType.FunctionPtr)] NativeDelegates.gpc_vertex_calloc_fn gpc_vertex_calloc);

So, we’re half way there, and now we need to implement the allocator on the managed side.

    internal unsafe sealed class PinnedManagedArrayAllocator<T>
        : Disposable
          where T : struct
    {
        private Dictionary<IntPtr, T[]> pbArrayToArray;
        private Dictionary<IntPtr, GCHandle> pbArrayToGCHandle;

        public PinnedManagedArrayAllocator()
        {
            this.pbArrayToArray = new Dictionary<IntPtr, T[]>();
            this.pbArrayToGCHandle = new Dictionary<IntPtr, GCHandle>();
        }
 
        // (Finalizer is already implemented by the base class (Disposable))

        protected override void Dispose(bool disposing)
        {
            if (this.pbArrayToGCHandle != null)
            {
                foreach (GCHandle gcHandle in this.pbArrayToGCHandle.Values)
                {
                    gcHandle.Free();
                }

                this.pbArrayToGCHandle = null;
            }

            this.pbArrayToArray = null;

            base.Dispose(disposing);
        }

        // Pass a delegate to this method for “gpc_vertex_calloc_fn”. Don’t forget to use GC.KeepAlive() on the delegate!
        public IntPtr AllocateArray(int count)
        {
            T[] array = new T[count];
            GCHandle gcHandle = GCHandle.Alloc(array, GCHandleType.Pinned);
            IntPtr pbArray = gcHandle.AddrOfPinnedObject();
            this.pbArrayToArray.Add(pbArray, array);
            this.pbArrayToGCHandle.Add(pbArray, gcHandle);
            return pbArray;
        }

        // This is what you would use instead of, e.g. Marshal.Copy()
        public T[] GetManagedArray(IntPtr pbArray)
        {
            return this.pbArrayToArray[pbArray];
        }
    }

(“Disposable” is a base class which implements, you guessed it, IDisposable, while also ensuring that Dispose(bool) never gets called more than once. This is important for other places where thread safety is very important. This class is specifically not thread safe, but it should be reasonably easy to make it so.)

And that’s it! Well, almost. I’m omitting the guts of the interop code but nothing that should inhibit comprehension of this part of it. Also, the above code is not hardened for error cases, and should not be used as-is for anything running on a server or in a shared process. Oh, and I just noticed that my Dispose() method has an incorrect implementation, whereby it shouldn’t be using this.pbArrayToGCHandle, specifically it shouldn’t be foreach-ing on it, and should instead wrap that in its own IDisposable-implementing class … exercise for the reader? Or I can post a fix later if someone wants it.

After I’ve called gpc_polygon_clip, instead of copying all the arrays using something like System.Runtime.InteropServices.Marshal.Copy(), I just use GetManagedArray() and pass in the pointer that GPC retrieved from its gpc_vertex_calloc_fn, aka AllocateArray(). When I’m done, I dispose the PinnedManagedArrayAllocator and it unpins all the managed arrays. And this is much faster than making copies of everything.

Now, this isn’t the exact code I’m using. I’ve un-generalized it in the real code so I can allocate all of the arrays at once instead of incurring potentially hundreds of managed <—> native transitions for each allocation. The above implementation also doesn’t have a “FreeArray” method; I had one, but I ended up not needing it, so I removed it.

So the next time you find yourself calling into native code which either 1) allows you to specify an external allocator, or 2) is part of your build, and that 3) involves lots of data and thus lots of copying and wasted CPU time, you might just consider using the tactic above. Your users will thank you.

Lastly, I apologize for my blog’s poor code formatting.

Legal: I hereby place the code in this blog post into the public domain for anyone to do whatever they want with. Attribution is not required, but I certainly appreciate if you send me an e-mail or post a comment and let me know it was useful for you.

* Actually in Paint.NET 3.5, I changed it to use HeapAlloc(). This way I can get an exception raised when it runs out of memory, instead of corrupt results. This does happen on 32-bit/x86, especially when using the Magic Wand on large images.