Validating .NET/RESX translations easily with LINQ-to-XML

One of the tools I finally decided to sit down and write was a "ResXCheck" utility. RESX files hold the string resources for many .NET programs, including Paint.NET. At build time, these are compiled into a binary format and then stored in a DLL, or a RESOURCES file (Paint.NET uses the latter). The binary format is simply more efficient to access at run-time, and resgen.exe can even be used to convert between the two formats.

Essentially, a RESX is an XML files with a big list of name/value pairs. I will refer to these as string names, and string values. The XML looks like this:

<data name="GradientTool.HelpText">
<value>Click and drag to start drawing. Holding shift constrains the angle. Right mouse button reverses colors.</value>
</data>
<data name="GradientTool.HelpText.WhileAdjusting.Format">
<value>Offset: {0}{1} x {2}{3}, Length: {4} {5}, Angle: {6}°. Holding other mouse button will move both nubs.</value>
</data>

The RESX files for translations are in the same format and should have all the same string names. The <value> elements should have different text in them however (it should be translated!). At runtime, both the translated and original RESOURCES files are needed — the latter is used as a "fallback" in case a string is not defined in the translation. Whether this is an error is up to your translation process and resource loader code. Sometimes you want that behavior, such as if you have the base "EN" (English) translation and then some strings must vary for "EN-US" (U.S. English) or "EN-GB" (British English). You can store common string definitions in "EN", although usually it is better to completely duplicate the content, and use a tool to maintain the duplication.

Having worked with RESX files on Paint.NET and elsewhere, the following problems come up:

You could use the same string name twice. The RESX compiler will simply grab one of them, either the first or the last (I can’t remember). This is a problem if you go to update a string later and you change the wrong one. Then, your changes might not show up in the main program and you won’t know why. And you’ll have a heck of a time figure it out.
A translation could be missing some string names. If this happens, generally the fallback text (usually English) will show up. That is probably not the desired behavior, although in Paint.NET there are a few places where it’s okay. Translating a technical acronym for a pixel format, such as "A8R8B8G8", isn’t really necessary.
A translation could have extra string names defined. This is likely to happen if strings are removed from the original RESX, but the translation hasn’t been updated yet. This will not cause any errors, it is just extra cruft that can accumulate if you don’t pay attention (most professional translation teams have tools which handle this case automatically).
A string value could have "malformed" formatting tags. In the XML listed above, the second text has formatting tags such as {0} through {6}. These represent values which must be supplied at runtime by the application. There are two hazards here. One is that you could have formatting that String.Format(…) doesn’t like, such as having a { without a closing }, or vice versa. The other hazard is if a translation defines extra formatting tags, such as if a {7} was added above. Then your application will crash when it goes to apply formatting to that string. This is mostly a problem when strings have not yet been updated for a newer version. The translation may define fewer formatting tags, and this may or may not be an error. You may have a formatting tag that represents a piece of text that is not necessary to display in a particular translation.

Luckily, all of these can be checked for with some simple automation, which is what I have done with ResXCheck. I will be including its code in the next Paint.NET source code drop (for v3.30), and plan to tag it as "public domain" (just the utility, not Paint.NET itself). In the mean time, here’s a little utility function that can help you load a RESX and convert it to an IEnumerable of type KeyValuePair<string, string> (duplicate string names are not removed — this is important for being able to check #1 above).

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml.Linq;
using System.Xml.XPath;

// Given a file name for a RESX, returns a non-consolidated list of string name/value pairs.
IEnumerable<KeyValuePair<string, string>> FromResX(string resxFileName)
{
    XDocument xDoc = XDocument.Load(resxFileName);

    var query = from xe in xDoc.XPathSelectElements("/root/data")
                let attributes = xe.Attributes()
                let name = (from attribute in attributes
                            where attribute.Name.LocalName == "name"
                            select attribute.Value)
                let elements = xe.Elements()
                let value = (from element in elements
                             where element.Name.LocalName == "value"
                             select element.Value)
                select new KeyValuePair<string, string>(name.First(), value.First());

    return query;
}

To check for duplicates, a simple query such as the following is all that’s needed:

var resx = FromResX("strings.resx");

var dupeItems = resx.ToLookup(kv => kv.Key, kv => kv.Pair)      // 1
                    .Where(item => item.Take(2).Count() > 1)    // 2
                    .SelectMany(item => item.Select(val =>      // 3
                    &
#160;    new KeyValuePair<string, string>(item.Key, val.Value));

// 1 — converts from KeyValuePair<string, string>[] to IEnumerable<IGrouping<string, string>> — essentially a list of keys, each of which has a nested list of values
// 2 — finds any key which has 2 or more values in it. the "Take(2)" is an optimization
// 3 — convert back to a list of key,value pairs (probably not necessary if you use the "T-SQL" style syntax)

You can then do a foreach() over this virtual list and print out the key,value pairs. I could have written that query using the more succinct "T-SQL" style query syntax, but I hadn’t yet learned it when I wrote that part of the code. ResXCheck was a little project I took on to force myself to learn more about LINQ. Surprisingly, it only took about 2 minutes to learn the more compact query syntax.

I’m quite happy with LINQ. It’s letting me do some powerful data manipulation with very succinct, expressive code. And it’s very simple! I’ve already found a few mistakes in my RESX files, and they will be easy to fix. This tool will also help volunteer translators who publish their translations on the forum. I know it is hard to validate these things for correctness sometimes, especially for problem #4 list above.

Oh and for fun I made the utility parallelize the processing so you can validate "N" number of translations at the same time. Sadly, on my quad-core box it only dropped the validation time from 560 milliseconds down to 300. If I only have 50 more translations, then I could really stress it! 🙂

Paint.NET Blog

Validating .NET/RESX translations easily with LINQ-to-XML

One thought on “Validating .NET/RESX translations easily with LINQ-to-XML”

Share this:

Related

One thought on “Validating .NET/RESX translations easily with LINQ-to-XML”