Research / 2019

SPoC Search-based Pseudocode to Code

Sumith Kulal*, Panupong Pasupat*, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy Liang

Program Synthesis
Pseudocode-to-Code
Error Localization

In brief

The paper introduces SPoC, a framework and dataset for translating human-written pseudocode into functionally correct C++ programs via search guided by compilation-error–driven credit assignment.

Executive Summary

The authors tackle mapping natural-language pseudocode to longer, functionally correct programs by searching over per-line translations and validating against test cases. They introduce error-localization signals from compilation to guide search and release the SPoC dataset with 18,356 C++ programs annotated with human pseudocode and test cases, achieving 44.7% success under a 100-compilation budget. :contentReference[oaicite:0]{index=0}

Key Technical Advancements

Search over line-wise candidate translations: A seq2seq model generates up to M=100 candidates per pseudocode line; best-first search explores combinations by probability and public test outcomes, rather than relying on a single top translation. This reframes synthesis as combinatorial assembly of line candidates.

Compilation-error–driven credit assignment: Since 88.7% of failures are compilation errors, two methods localize offending portions: (1) a multiclass classifier that predicts the erroneous line using pseudocode, produced code, and the first compiler message; (2) prefix-based pruning, which finds a minimal failing prefix and safely blacklists it. Both focus search where fixes likely matter.

SPoC dataset (human-authored pseudocode + tests): 18,356 accepted C++ solutions (avg 14.7 lines), each paired with line-granular pseudocode and both public/hidden tests; 59 annotators provided consistent-granularity descriptions. This supports evaluation by functional correctness, not just syntax.

Practical Implications & Use Cases

Technical impact: Relative to using the top translation only, search guided by error localization lifts success to 44.7% at 100 trials; the multiclass model reduces trials in 15.5% of programs (median –26 trials, –42% relative). Prefix pruning helps more at larger budgets.

UX/UI implications: Not explicitly discussed; the work focuses on compiler/error-signal–guided automation and benchmarkable synthesis outcomes under trial budgets.

Strategic implications: The oracle cap—55.2% (TESTP) and 71.4% (TESTW) given top-100 candidates—highlights that improving candidate generation is crucial; released data enables systematic progress tracking.

Challenges and Limitations

Error-line mismatch: Compiler-reported line numbers are wrong 21.7% of the time, so naïvely down-weighting the reported line can hurt synthesis. Accurate localization is essential.

Candidate coverage and catastrophic cases: Many programs contain at least one line with no correct candidate in the top-100 list (44.8% TESTP; 28.6% TESTW), limiting achievable success; the multiclass method can mislocalize and cause failures on some cases.

Future Outlook & Considerations

The results suggest combining stronger line translators (to raise the oracle ceiling) with robust, verifiable localization (e.g., prefix checks) to guide search under tight budgets. The SPoC release (programs, pseudocode, tests) provides a common ground for evaluating functional correctness and for studying search vs. localization trade-offs on longer programs.

Conclusion

SPoC demonstrates that search + compilation-aware credit assignment can turn line-level pseudocode into functionally correct programs at scale, outperforming top-one translation and revealing where current systems bottleneck—error localization and candidate coverage. The dataset and methods lay groundwork for future advances in practical, test-driven program synthesis.