Techniques

Renovation tools routinely comprise the following main components: preprocessors, parsers, analyzers, transformers, visualizers, pretty printers, and postprocessors. In many cases, language-parameterized (or generic) tools are available to construct these components.

Figure 1 depicts a grammar-centric approach to enabling rapid development of renovation tools. Arrow length indicates the degree of effort involved (longer arrows imply more effort). As can see, if one have a generic core and a grammar, it does not take much effort to construct parsers, tree walkers, pretty printers, and so on. Although these components depend on a particular language, their implementation uses generic language technology: a parser is produced using a parser generator, a pretty printer is created using a formatter generator,  and tree walkers for analysis or modification are generated similarly.  All these generators rely heavily on the grammar. Once having the grammar and the relevant generators, one can rapidly set up this core for developing software renovation tools. The longest arrow in Figure 1 expresses the current situation which takes a lot of effort to create those grammars.

 

Grammar stealing covers almost all languages


The following exhaustive case distinction shows that the approach covers virtually all languages. Because the software to be converted already exists, it can be compiled or interpreted.

Firstly enter the Compiler Sources diamond. There are two possibilities: the source code is or is not available to you. If it is, you just have to find the part that turns the text into an intermediate form. That part now contains the grammar in some form. You do this by lexically searching the compiler source code for the language's keywords. Compiler constructors implement a parser in one of three ways: they hardcode it, use a parser generator, or do both (in a complex multilanguage compiler, for instance).

Figure 2 shows the first two cases the third is just a combination. If you start with a hard-coded grammar, you must reverse-engineer it from the handwritten code. Fortunately, the comments of such code often include BNF rules (Backus Naur Forms) indicating what the grammar comprises. Moreover, because compiler construction is well-understood (there is a known reference architecture), compilers are often implemented with well-known implementation algorithms, such as a recursive descent algo-rithm. So, the quality of a hard-coded parser implementation is usually good, in which case you can easily recover the grammar from the code, the comments, or both. Except in one case, the Perl language,  the quality of the code we worked with was always sufficient to recover the grammar.

If the parser is not hard-coded, it is generated (the BNF branch in Figure 2), and some BNF description of it must be in the compiler source code. So, with a simple tool that parses the BNF itself, we can parse the BNF of the language that resides in the compiler in BNF notation, and then extract it.

When the compiler source code is not accessible (we enter the Language Reference Manual diamond in Figure 2), either a reference manual exists or not. If it is available, it could be either a compiler vendor manual or an official language standard. The language is explained either by example, through general rules, or by both approaches. If a manual uses general rules, its quality is generally not good: reference manuals and language standards are full of errors. It is our experience that the myriad errors are repairable. As an aside, we once failed to recover a grammar from the manual of a proprietary language for which the compiler source code was also available (so this case is covered in the upper half of Figure 2). As you can see in the coverage diagram, we have not found low quality language reference manuals containing general rules for cases where we did not have access to the source code. That is, to be successful, compiler vendors must provide accurate and complete documentation, even though they do not give away their compilers' source code for economic reasons. We discovered that the quality of those manuals is good enough to recover the grammar. This applies not only to compiler-vendor manuals but also to all kinds of de facto and official language standards.

Unusual languages rarely have high-quality manuals: either none exists (for example, if the language is proprietary) or the company has only a few customers. In the proprietary case, a company is using its in-house language and so has access to the source code; in the other case, outsiders can buy the code because its business value is not too high. For instance, when Wang went bankrupt, its key customers bought the source code for its operating system and compilers to create their own platform and dialect migration tools. This explains why we do not know of low-quality manuals containing general rules. In one case, that of RPG, the manual explains the language through code examples, and general rules are absent. We can examine this case in more detail if we are asked for an RPG renovation project involving a large amount of RPG code. We think we can systematically extract RPG's general rules from the code examples.

In addition, because the manual contains code examples, there is a good chance that the compiler has tested these examples. This means that the manual's formal content could be of a much higher quality than you would expect from such documents.

Finally, we must deal with the case in which we have no access to the compiler sources or a reference manual.  For customers who had bought the source code and hence could solve their problems using the upper half of Figure 2 but for instance, if the core developers throw away the sources after their company had collapsed, no grammer can be recovered. Summarizing, our coverage diagram shows that you can recover virtually any grammar, whether you have the compiler sources or not.

 

Grammer Stealing in Practice


Java, PL/ I, Ericsson PLEX, C++, Ada 95, VS Cobol II, AT& T SDL, Swift messages, and more can be applied grammer stealing successful. For instance, let us focus on PLEX (Programming Language for Exchanges), a proprietary, nontrivial, realtime embedded-system language (for which the compiler source code was accessible to us), and, at the other end of the gamut, VS Cobol II, a well-known business language (for which no compiler code was available). Both languages are used in business-critical systems: the AXE 10 public branch exchange uses PLEX, and numerous IBM mainframe systems run VS Cobol II. These two languages represent the two main branches in Figure 2.

Approach uses a unique combination of powerful techniques:

automated grammar extraction,
sophisticated parsing,
automated testing, and
automated grammar transformation.

If one of these ingredients is missing, the synergy is gone. Extraction by hand is error prone, and basic parsing technology limits one to work with grammars in severely limited formats. With powerful parsing technology, one can work with arbitrary context-free grammars and test them regardless of their format. Without automated testing, one cannot find many errors quickly. Without tool support to transform grammar specifications, analyses are inaccurate and corrections are inconsistent; without transformations, one cannot repeat what have done or change initial decisions easily. So, to steal grammars, one needs to know about grammars, powerful parsing techniques, how to set up testing, and automated transformations.

 

Stealing from compiler source code


Ericsson uses the extremely complex proprietary language PLEX to program public telephone switches. PLEX consists of about 20 sublanguages, called sectors, including high-level programming sectors, assembly sectors, finite-state-machine sectors, marshaling sectors, and others. Applied our grammar-stealing approach to PLEX as follows:

  1. Reverse-engineer the PLEX compiler (63 Mbytes of source code) on site to look for grammar-related files. We learned that there were BNF files and a hard-coded parser.
  2. Find the majority of the grammars in some BNF dialect.
  3. Find a hand-written proprietary assembly parser with BNF in the comments.
  4. Write six BNF parsers (one for each BNF dialect used).
  5. Extract the plain BNF from the compiler sources and convert it to another syntax definition formalism (SDF) for technical reasons.
  6. Find the files containing the lexical analyzer and convert the lexical definitions to SDF.
  7. Combine all the converted grammars into one overall grammar.
  8. Generate an overall parser with a sophisticated parser generator.
  9. Parse the code.

PLEX grammar was recovered in two weeks, including tool construction, parser generation, and testing with 8-MLOC PLEX code, at a cost of US$ 25,000.

To illustrate the limited complexity of the work, consider the fragment of raw compiler source code in Figure 4. A PLEX program consists of a header, a list of statements, the phrase END PROGRAM , and a closing semicolon. The other code in the figure deals with semantic actions relevant to the compiler. Our tools converted this to a common BNF while removing the idiosyncratic semantic actions:

plex-program : : = program-header 
                     statement-row 
                     'END' 'PROGRAM' ';'
Then the tools converted this into SDF, which was subsequently fed to a sophisticated parser generator accepting arbitrary context-free grammars. The output was
Program-header 
Statement-row 
END PROGRAM ; -> Plex-program 

The tools that was built automatically recovered most of the 3,000+ production rules in one afternoon. Then we tested each sector grammar separately. We used a duplicate detector to weed out production rules that were used in more than one sector grammar, so that we could construct an overall grammar able to parse complete PLEX programs. One assembly sector parser was hardcoded (see Figure 2), so we had to recover its grammar through reverse engineering. The comments accompanying the code contained reasonably good BNF, so there's no problem with this task. With all the sector grammars combined, we generated a parser to test it with an 8-MLOC PLEX test suite. The only files that did not parse were compiler test files that were not supposed to parse--the rest passed the test. In addition, we generated a Web-enabled version of the BNF description as a basis for a complete and correct manual.

 

Stealing from reference manuals

In another two-week effort, VS Cobol II grammar from IBM's manual was recovered. And, the process was straightforward:

  1. Retrieve the online VS Cobol II manual.
  2. Extract its syntax diagrams.
  3. Write a parser for the syntax diagrams.
  4. Extract the BNF from the diagrams.
  5. Add 17 lexical rules by hand.
  6. Correct the BNF using grammar trans-formations.
  7. Generate an error-detection parser.
  8. Incrementally parse 2 million lines of VS Cobol II code.
  9. Reiterate steps 6 through 8 until all errors vanish.
  10. Convert the BNF to SDF.
  11. Generate a production parser.
  12. Incrementally parse VS Cobol II code to detect ambiguities.
  13. Resolve ambiguities using grammar transformations.
  14. Reiterate steps 11 through 13 until you find no more ambiguities.

So, apart from some cycles to correct errors and remove ambiguities, the process is the same as in the earlier case, where we had access to the compiler source. An error-detection parser detects errors in the grammar from which it is generated. In this case, we used an inefficient top-down parser with infinite lookahead. It accepts practically all context-free grammars and does not bother with ambiguities at all. We use this kind of parser to test the grammar, not to produce parse trees. Because we only used compilable code, all the errors that this parser detects raise potential grammar problems. In this way, we found all the omissions, given our Cobol testbed. When all our test code passed the top-down parser, we converted the grammar to SDF, generated a parser that detects ambiguities, and corrected them. This project also took two weeks of effort, including tool construction and testing.

To get an idea of the limited complexity of this technique, consider the syntax diagram shown in Figure 5a, taken from the manual. After conversion to BNF and correction, the diagram looks like the one in Figure 5b.

We used grammar transformations to remove the dash between NEXT and SENTENCE and to replace the two occurrences of imperative-statement with statement-list. The diagram was overly restrictive, allowing only one statement. However, in the manual's informal text we learned, "A series of imperative statements can be specified whenever an imperative statement is allowed." Our error-detection parser found these errors: first, the tool parsed code when it found NEXT SENTENCE , that is, without a dash. After inspecting the manual and grammar, we wrote a grammar transformation repairing this error. The error-detection parser also found that, according to the compiler, more than one statement was correct whereas the manual insisted on exactly one statement. We repaired this error with a grammar transformation.

Next, in a separate phase, we removed ambiguities. For example, the following fragment of a syntax diagram is present in the Cobol CALL statement:

_____identifier__________________
|__ADDRESS__OF__identifier__| 
|__file-name________________| 

This stack of three alternatives can lead to an ambiguity. Namely, both identifier and file-name eventually reduce to the same lexical category. So, when we parsed a CALL statement without an occurrence of ADDRESS OF , the parser reported an ambiguity because the other alternatives were both valid. Without using type information, we cannot separate identifier from file-name. This is the ambiguous extracted BNF fragment:

(identifier 
| "ADDRESS" "OF" identifier
| file-name) 

With a grammar transformation, we eliminated the file-name alternative, resulting in

(identifier 
| "ADDRESS" "OF" identifier) 

The adapted grammar accepts the same language as before, but an ambiguity is gone. Note that this approach is much simpler than tweaking the parser and scanner to deal with types of names. In this way, we recovered the entire VS Cobol II grammar and tested it with all our Cobol code from earlier software renovation projects and code from colleagues who were curious about the project's outcome. For the final test, we used about two million lines of pure VS Cobol II code. As in the PLEX case, we generated a fully Web-enabled version of both the corrected BNF and the syntax diagrams that could serve as the core for a complete and correct language reference manual.

Back to Specific Comments