AST-GREP Learning experience

I've been digging into ast-grep, a tool for searching and re-writing code that takes advantage of the code's syntax. Here's a problem I've been working through and lessons I've learned along the way.

My Problem

When writing unit tests I often end up with very long parameter lists. The AutoFixture framework for C# uses attributes which makes parameters with often long types even longer. Add that to long test method names following unit testing conventions and you get some very long method definition lines:

public async Task UnitUnderTest_Scenario_ExpectedResult([Frozen] IApiService apiService, [Frozen] ITimeService timeService, [Frozen] ILogicService logicService, SubjectUnderTest sut)
{}

These are a pain to read, and a pain to adjust individual parameters. Ideally it would be re-written as:

public async Task UnitUnderTest_Scenario_ExpectedResult(
    [Frozen] IApiService apiService,
    [Frozen] ITimeService timeService,
    [Frozen] ILogicService logicService,
    SubjectUnderTest sut
)
{}

This change can easily be accomplished with LSP tools triggered from an editor. But it would be nice not to expect every developer on a project to remember to do this and just take care of it automatically.

Regular Expressions

Regular expressions (regex) can do this, but it would require a lot of work to handle every scenario. Something like s/, /,\n/g would change commas to be followed by new lines if all we had was what is above. But what about commas elsewhere? It would be a trick to limit the matches to parameter lists in a method definition. We could make sure everything is in a set of parentheses. But that would also pick up arguments in method calls. Plus it would pick up commas in type arguments for generics (GenericType<TypeA, TypeB>). Then we'd need special cases if we want a newline before the first argument. And we'd need to account for the possibility of some parameters being wrapped and others not. We couldn't expect all parameters to be formatted as expected or all on a single line needing to be wrapped. For example, we'd need to catch the following case as well:

public async Task UnitUnderTest_Scenario_ExpectedResult(
    [Frozen] IApiService apiService,
    [Frozen] ITimeService timeService, [Frozen] ILogicService logicService,
    SubjectUnderTest sut)
{}

It would be great if we could know if text was part of a parameter list or not.

ast-grep

ast-grep is a search and replace tool that takes advantage of Abstract Syntax Trees (AST) for code. ASTs represent the syntactical structure of the code. For example, it marks function definitions, parameter lists, argument lists, if blocks, etc. which allows limiting matches to particular kinds of syntax nodes. And it is structured as a tree with nodes nested under others. For example, our first code snippet's AST looks like following, regardless of whitespace.

(compilation_unit ; [0, 0] - [3, 0]
  (global_statement ; [0, 0] - [1, 2]
    (local_function_statement ; [0, 0] - [1, 2]
      (modifier) ; [0, 0] - [0, 6]
      (modifier) ; [0, 7] - [0, 12]
      type: (identifier) ; [0, 13] - [0, 17]
      name: (identifier) ; [0, 18] - [0, 55]
      parameters: (parameter_list ; [0, 55] - [0, 182]
        (parameter ; [0, 56] - [0, 87]
          (attribute_list ; [0, 56] - [0, 64]
            (attribute ; [0, 57] - [0, 63]
              name: (identifier))) ; [0, 57] - [0, 63]
          type: (identifier) ; [0, 65] - [0, 76]
          name: (identifier)) ; [0, 77] - [0, 87]
        (parameter ; [0, 89] - [0, 122]
          (attribute_list ; [0, 89] - [0, 97]
            (attribute ; [0, 90] - [0, 96]
              name: (identifier))) ; [0, 90] - [0, 96]
          type: (identifier) ; [0, 98] - [0, 110]
          name: (identifier)) ; [0, 111] - [0, 122]
        (parameter ; [0, 124] - [0, 159]
          (attribute_list ; [0, 124] - [0, 132]
            (attribute ; [0, 125] - [0, 131]
              name: (identifier))) ; [0, 125] - [0, 131]
          type: (identifier) ; [0, 133] - [0, 146]
          name: (identifier)) ; [0, 147] - [0, 159]
        (parameter ; [0, 161] - [0, 181]
          type: (identifier) ; [0, 161] - [0, 177]
          name: (identifier))) ; [0, 178] - [0, 181]
      body: (block)))) ; [1, 0] - [1, 2]

Seeing what we care about is a parameter_list (named "parameters") we could rewrite our naive regular expression above to something like this:

rule:
  pattern: ","
  inside:
    kind: parameter_list
fix: ",\n"

ast-grep Playground

This would match commas only inside of a parameter list and append new lines after them. This solves the problem of dealing with commas in function call argument lists and elsewhere. It is even smart enough to only replace commas inside the parameter list directly, and not within a parameter's generic type argument list (we'll bump into this problem later). But it doesn't account for wrapping the first parameter.

Transforms

In the above example we did a simple replacement. Commas inside of a parameter_list are replaced by commas followed by a new line. Another option is to use transforms. Transforms allow replacements on meta-variables (none were used in the first example).

rule:
  pattern: $PARAMS
  kind: parameter_list
transform:
  NEW:
    replace:
      source: $PARAMS
      replace: (?<MATCH>[\(,])
      by: "$MATCH\n"
fix: $NEW

ast-grep Playground

Here we match on the parameter_list instead of the commas within it. That list is saved into a meta variable, $PARAMS. We then use a transform to apply changes to what was matched by our rule, and put its result into $NEW. This transform uses replace: to do a regex replacement. Its source is $PARAMS, taken from the rule pattern. With replace: (a field of the first replace:) we specify a regex to find matches within the source. This regex ether matches on a ( or a , by the regex [\(,]. It saves the match to another variable, $MATCH, by putting that pattern in (?<MATCH>...). Then the replacement is specified as "$MATCH\n", putting a new line after each opening paren and comma.

Note that this replacement applies to each match, not just the first one. This allows us to replace all the commas and the opening paren. Also note that putting the by: string in double quotes is important here. If single quotes or no quotes were used \n we be replaced as is instead of treating it as a new line.

Finally $NEW, the result of the transform, is used as our final replacement via fix:.

If you looked at the result in the playground you may have noticed this has a flaw. It adds new lines after commas in generic type arguments (Generic<A, B>). The first example missed this problem as it operated on commas directly inside the parameter list node. But here we're applying our replacement to the string composing the parameter list. Another flaw is that first parameter is not indented like the remaining parameters.

Chaining Transforms

We can solve the problem of the first parameter not being indented by breaking up our replacement to handle commas and opening parens separately, along with their indentation. Here the transforms are chained with $PARAMS as the source feeding into $NEW and $NEW as the source feeding into $NEWER which is used in our final fix.

rule:
  pattern: $PARAMS
  kind: parameter_list
transform:
  NEW:
    replace:
      source: $PARAMS
      replace: \,
      by: ",\n   "
  NEWER:
    replace:
      source: $NEW
      replace: \(
      by: "(\n    "
fix: $NEWER

ast-grep Playground

You'll note that there is one more space for the opening paren replacement. That is because all our commas have a space after them which was left in place and placed after the new line. Of course we might not always have a single space following the comma. Some awful programmer might not put spaces after their commas. Or they might go crazy with multiple with spaces. That would ruin all our indentation. We could adjust our first replace: regex to \,\s* to include zero or more whitespace characters after the comma so that they are replaced as well.

Rewriters

Let's take a look at another option for replacements: rewriters. Essentially, with rewriters we can have named rules and fixes that are applied through a transform. The advantage is that multiple rewriters can be applied, each targetting different nodes in the match. Though for this example only one rewriter is used.

rewriters:
- id: param-wrap
  rule:
    kind: parameter
    pattern: $P
  fix:
    template: $P
    expandEnd: { regex: ',' }
rule:
  pattern: $M $T $F($$$PARAMS)
transform:
  REPARAM:
    rewrite:
      rewriters: [param-wrap]
      source: $$$PARAMS
      joinBy: ",\n"
fix: |
  $M $T $F(
    $REPARAM)

ast-grep Playground

Here we start off by defining one rewriter with an id param-wrap by which we'll reference it later. It matches any parameter, saved in variable $P. In the fix we replace the parameter with itself, $P. The commas are not a part of the parameter nodes, so we expand the match to include commas with expandEnd:. This allows removing the commas so that later on we can re-add the comma followed by a new line. If we didn't do this the added new line would come after the parameter but before comma. Since expandEnd: is a field within fix: we can't specify the fix itself as the value to fix: as done in the first example. template: is the field of fix: that allows specifying the fix string when other fields are needed.

The rewriter is then applied by using a transform: with rewrite: and the rewriter specified by its id. Multiple rewriters could have been specified in the rewriters: list. The rewrite is limited by the source: to apply it only on $$$PARAMS. Remember, our rewriter is just removing the commas. Now they are added back in with a new line following by joining together the parameters with ,\n.

In our rule pattern you see we have $M $T and $F(...). These match modifiers (public, private, static) with $M, the return type with $T, and the function name with $F. With this approach of just using a pattern to do our matching specifying $M and $T was necessary to differentiate our pattern from a function call. But it has the problem that this rule can fail on methods with multiple modifiers: public static void MyFunc(...). Something to solve for another day.

Multi-Meta Variables

When a variable is preceded by three dollar-signs, like $$$V (note the 3 dollar signs), it is called a multi-meta variable. A multi-meta variable can match zero or AST nodes. In the previous example it was necessary to use the multi meta variable $$$PARAMS as the method's parentheses are a part of the parameter_list. Since in our pattern the parentheses are separate from our meta variable, a single meta variable can't match the parameter_list node, but only the parameters within it. If we had used $PARAMS, a single meta variable, it would match if there was only one parameter in the list, but not more or less. We use $$$PARAMS to match multiple parameters into a single variable.

In the next example I thought I needed to use a multi-meta variable to match multiple parameters. Since I was actually matching a single node, the parameter_list, it was unnecessary. But it shows an important aspect of multi-meta variables.

  rewriters:
  - id: param-wrap
    rule:
      kind: parameter
      pattern: $P
    fix:
      template: $P
      expandEnd: { regex: ',' }
  rule:
+   pattern: $$$PARAMS
+   kind: parameter_list
  transform:
    REPARAM:
      rewrite:
        rewriters: [param-wrap]
        source: $$$PARAMS
        joinBy: ",\n"
  fix: |
+     ($REPARAM)

ast-grep Playground

This is pretty similar to the previous example. But now we've been able to drop the trouble of proving our list is a parameter_list inside a function by matching the modifier, type, and function name nodes. Now the rule: just matches on kind: parameter_list with a pattern to save the match into a variable. But if you try out this rule on the playground you'll see the parameter list is just deleted instead of replaced.

The reason for this is $$$PARAMS is being used as a standalone pattern that only captures one node. But multi meta variables are intended to capture a list of nodes. To go along with that there is a bug where the multi meta variable $$$PARAMS is captured in the single meta variable $PARAMS (which I'll note, HerringtonDarkholme, the author of ast-grep, was quick to identify). But the fix is simple in this case. A multi meta variable is not needed, so just use a regular meta variable.

  rewriters:
  - id: param-wrap
    rule:
      kind: parameter
      pattern: $P
    fix:
      template: $P
      expandEnd: { regex: ',' }
  rule:
+   pattern: $PARAMS
    kind: parameter_list
  transform:
    REPARAM:
      rewrite:
        rewriters: [param-wrap]
+       source: $PARAMS
        joinBy: ",\n"
  fix: |
      (
        $REPARAM
      )

ast-grep Playground

Now we've got new lines after all our commas, excluding those that aren't separating parameters in the parameter list. It's working just the way to want it. But we could make some tweaks to apply it just when we want it.

Ignore Short Lists

The whole reason we wanted to wrap our parameter lists is they get very long and hard to read. But what if we only have a few parameters. Maybe it is fine to keep them on the same line. Or worse, what if there are no parameters. The last rule inserts new lines which are completely unnecessary.

Let's modify our rule to only match parameter lists with a least 3 parameters.

  rewriters:
  - id: param-wrap
    rule:
      kind: parameter
      pattern: $P
    fix:
      template: $P
      expandEnd: { regex: ',' }
  rule:
    pattern: $PARAMS
    kind: parameter_list
+   has:
+     nthChild: 3
  transform:
    REPARAM:
      rewrite:
        rewriters: [param-wrap]
        source: $PARAMS
        joinBy: ",\n"
  fix: |-
      (
        $REPARAM
      )

ast-grep Playground

The change is small. Under rule: we've added has: nthChild: 3. Examples show nthChild used to match a particular sub-node. But here we use it so say our parameter_list has a 3rd parameter. Now this rule will skip any parameter list with 2 or fewer parameters.

But maybe we care less about the number of parameters, and more about the overall length of the parameter_list. If many parameters take up little space we don't need to wrap that. But a few parameters that take up a lot of space with attributes or long types, we want to wrap. We can achieve that by replacing has: nthChild 3 with the following:

  regex: ".{25,}"

ast-grep Playground

This is a simple regex that says if our parameter_list has 25 or more characters, apply the rule.

Closing Thoughts

ast-grep is a great tool for modifying code programmatically constrained by the underlying syntax. It allows us to ignore variations in formatting by utilizing the syntax nodes. But it requires dealing with both the abstract syntax tree and the plain text code. With this, there are many different options for matching and modifying code. This makes writing rules complex and challenging. But I believe it will pay off. I was able to take my end solution here and apply it to a similar problem with argument_attribute_lists very quickly. But I'm still far from having mastered ast-grep. Another problem I took own proved to have its own set of challenges I've yet to work through.