CodeFrame - Multi-Language Code Parser

A Tree-sitter-based code parser that extracts structural information from source files across multiple programming languages.

Supported Languages

Java (.java)
JavaScript (.js)
TypeScript (.ts)
Python (.py)
C# (.cs)
PHP (.php)

Features

For each supported language, CodeFrame extracts:

Type Information
- Class/Interface declarations
- Base classes (extends)
- Implemented interfaces
Method Information
- Method/Function names
- Parameters
- Local variables
- Method calls with object context

Usage

Build the project

./gradlew build

Run analysis (two arguments required)

CodeFrame requires two arguments: <input-path> and <output-file>.

# Gradle
./gradlew run --args="<input-path> <output-file>"

# Direct JAR
java -jar codeframe.jar <input-path> <output-file>

Examples:

# Analyze a single file, write to codeframe-out/analysis.jsonl
./gradlew run --args="src/main/java/org/example/MyClass.java codeframe-out/analysis.jsonl"

# Analyze an entire directory
./gradlew run --args="src/main/java codeframe-out/analysis.jsonl"

# Analyze the entire project
./gradlew run --args=". codeframe-out/analysis.jsonl"

# Run directly via java
java -jar codeframe.jar src codeframe-out/analysis.jsonl

Docker workflow

Use separate folders in the container:

/workspace: the CodeFrame project (bind-mounted to your repo)
/src: the codebase to analyze (mounted read-only)
Results are written under /workspace/.out (persisted on your host via the /workspace bind mount; .out/ is gitignored)

1) Build the image

docker build -t codeframe-dev .

2) Run the container with volumes

Windows (PowerShell):

docker run --rm -it `
  -v "$PWD:/workspace" `
  -v "C:\data\repos\my-project\src:/src:ro" `
  -w /workspace `
  codeframe-dev

Linux/macOS:

docker run --rm -it \
  -v "$PWD:/workspace" \
  -v "/absolute/path/to/your/repo:/src:ro" \
  -w /workspace \
  codeframe-dev

Optional debug port:

docker run --rm -it -p 5005:5005 \
  -e "JAVA_TOOL_OPTIONS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005" \
  -v "$PWD:/workspace" \
  -v "/absolute/path/to/your/repo:/src:ro" \
  -w /workspace \
  codeframe-dev

3) Run the program inside the container

./gradlew clean run --args="/src /workspace/.out/analysis.jsonl"

Output

The analysis results are written to the path you pass as the second argument (e.g., /workspace/.out/analysis.jsonl) in JSONL format (JSON Lines - one JSON object per line). Parent directories for the output file are created automatically, and .out/ is gitignored by default.

Ignore patterns (.ignore)

Location: project root .ignore (included in releases).

Default contents:

**node_modules**
**.git**
**.Designer.cs**
**.Designer.vb**

Syntax:
- Blank lines and lines starting with # are ignored.
- Globs supported: * (within a segment), ** (across segments).
- Paths are matched against normalized project paths relative to the input root.
Examples:
- **node_modules** → ignore anything under any node_modules folder.
- **.Designer.cs → ignore files ending with .Designer.cs anywhere.
- src/generated/** → ignore everything under src/generated/.

How it works:

CodeFrame loads .ignore at startup using dx-ignore and filters files before analysis.
If .ignore is missing, no files are excluded by ignore rules.

Why JSONL?

Memory efficient: Constant memory usage regardless of codebase size
Streamable: Process results line-by-line without loading entire file
Resumable: Can stop/restart analysis without losing progress
Parallel-friendly: Multiple threads can write safely

Output Structure

Each line is a separate JSON object with a kind field:

Line 1 - Run metadata:

{"kind":"run","started_at":"2025-09-30T11:00:00Z","input_path":"src","total_files":1000}

Lines 2-N - File analyses:

{"filePath":"src/Example.java","language":"java","packageName":"com.example","types":[{"kind":"class","name":"Example","visibility":"public","modifiers":["public"],"annotations":["@Component"],"extendsType":"BaseClass","implementsInterfaces":["Interface1"]}],"fields":[{"name":"service","type":"MyService","visibility":"private","modifiers":["private","final"],"annotations":["@Autowired"]}],"methods":[{"name":"processData","returnType":"Result","visibility":"public","modifiers":["public"],"annotations":["@Override"],"parameters":[{"name":"input","type":"String"}],"localVariables":["result"],"methodCalls":[{"methodName":"validate","objectType":"String","objectName":"input","callCount":1}]}],"imports":["import com.example.MyService;"]}

Error records (if any):

{"kind":"error","file":"src/Bad.java","language":"java","error":"Parse error"}

Last line - Completion metadata:

{"kind":"done","ended_at":"2025-09-30T11:00:05Z","files_analyzed":998,"files_with_errors":2,"duration_seconds":5}

Architecture

Core Components

Language - Enum defining supported languages
LanguageDetector - Detects language from file extension
LanguageAnalyzer - Interface for language-specific analyzers
FileAnalysis - Model containing analysis results

Language Analyzers

Each language has a dedicated analyzer:

JavaAnalyzer - Parses Java classes, interfaces, methods
TypeScriptAnalyzer - Parses TypeScript classes, interfaces, functions
JavaScriptAnalyzer - Parses JavaScript classes and functions
PythonAnalyzer - Parses Python classes and functions
CSharpAnalyzer - Parses C# classes, interfaces, methods
PHPAnalyzer - Parses PHP classes, interfaces, functions

Tree-sitter Integration

The project uses Tree-sitter grammar libraries:

tree-sitter-java
tree-sitter-javascript
tree-sitter-typescript
tree-sitter-python
tree-sitter-c-sharp
tree-sitter-php

Architectural Decisions

1) Choosing Tree-sitter

Incremental, robust parsing: Tree-sitter provides concrete syntax trees with stable node types across languages, suitable for structural extraction (types, methods, fields, calls).
Multi-language, consistent API: A single parsing approach across Java, JS/TS, Python, C#, PHP simplifies analyzer design and maintenance.
Performance and memory: Fast parsing with small memory footprint; aligns with our streaming JSONL output to keep RAM low for large repos.
Runtime constraints: In constrained runners/containers, we need deterministic, offline-friendly tooling. Tree-sitter grammars are shipped as Maven artifacts, avoiding runtime downloads or external CLIs.

2) Java binding: tree-sitter bonede

Bundled native libraries: The io.github.bonede:tree-sitter artifacts include native binaries for Windows/Linux/macOS. This removes the need for a local C toolchain or building native libs during CI/runtime.
Cross-OS compatibility: Works the same on developer machines, Docker (Linux), and Windows hosts—critical for heterogeneous environments.
Runtime constraints: In sandboxed environments we cannot install system packages or compile natives. Bonede’s prebuilt natives make the analyzer portable and ready-to-run without extra steps.

Extending

Adding a New Language

Add the Tree-sitter grammar dependency to build.gradle
Add the language to the Language enum
Update LanguageDetector with file extension mapping
Create a new analyzer implementing LanguageAnalyzer
Register the language and analyzer in App.java

Example: Adding Go Support

// 1. Add to Language enum
GO("go")

// 2. Update LanguageDetector
if (fileName.endsWith(".go")) {
    return Optional.of(Language.GO);
}

// 3. Create GoAnalyzer.java
public class GoAnalyzer implements LanguageAnalyzer {
    @Override
    public FileAnalysis analyze(String filePath, String sourceCode, TSNode rootNode) {
        // Implementation
    }
}

// 4. Register in App.java
TREE_SITTER_LANGUAGES.put(Language.GO, new TreeSitterGo());
ANALYZERS.put(Language.GO, new GoAnalyzer());

Requirements

Java 11+
Gradle 8.x
No native toolchain required (Tree-sitter natives are bundled via Maven artifacts)

License

This project uses Tree-sitter and its language grammars, which are licensed under MIT.

Limitations

All languages

Top-level fields/constants (for langauges that support them, e.g., JavaScript, TypeScript, Python, PHP) are not emitted as entries in the analysis output. The analyzer focuses on types (classes/interfaces/enums/records where applicable) and functions/methods.

JavaScript

Destructured parameter extraction is leaf-only. For a signature like fn({ data: { user, settings }, meta: { timestamp } }), parameters emitted are user, settings, timestamp (not data, meta).
Generator functions are marked using syntax-like modifiers:
- Top-level functions: "function*" (e.g., export function* name())
- Class methods: "*" (e.g., *methodName())
Dynamic import expressions import("path") are not modeled as method calls and are currently ignored in methodCalls.

C#

Called constructors and fields are not captured
- Current call extraction focuses on method invocations and property accessors. Constructor calls (e.g., new Type(...) and base(...)/this(...)) and direct field reads/writes are not emitted in methodCalls.
Loop local variables are not captured
- Variables declared in loop headers (e.g., for (var i = 0; ...), foreach (var x in ...)) are not added to localVariables.
- See src/test/resources/samples/csharp/LoopLocalsSample.cs for examples.
Events are not handled
- Event declarations/subscriptions/raises are not modeled.
- See src/test/resources/samples/csharp/DelegatesEventsLambdasSample.cs

Java

Constructor calls are not captured
- Constructor invocations (e.g., new ClassName(...)) are not emitted in methodCalls.
- See src/test/resources/samples/java/MultipleClasses.java for an example (new ExtraClass()).
Loop header locals are not captured
- Variables declared in loop headers (e.g., for (int i = 0; ...)) are not added to localVariables.
- See src/test/resources/samples/java/MultipleClasses.java for an example (for (int i = 0; i < times; i++)).
Local and anonymous classes are not extracted as separate types
- Bodies are analyzed within the enclosing method or type, and their method calls are recorded.
- The classes themselves do not appear as distinct types entries.
- See src/test/resources/samples/java/AnonymousInnerClassesSample.java.

Testing

ApprovalTests-based strategy
- We use ApprovalTests-Java to snapshot analysis results. Each test verifies the pretty-printed JSON using an approved artifact.
- When output changes, a .received.txt is generated next to the test class; review and promote it to .approved.txt if correct.
Running tests
- All tests: ./gradlew test
- Single test method, e.g. Java generics: ./gradlew test --tests "*JavaAnalyzeApprovalTest.analyze_Java_GenericsSample"
Workflow
- Make a change → run tests → inspect .received.txt → approve if expected → commit both code and updated .approved.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
releaseNotes		releaseNotes
src		src
.gitignore		.gitignore
.ignore		.ignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
instrument.yml		instrument.yml
settings.gradle		settings.gradle

License

dxworks/codeframe

Folders and files

Latest commit

History

Repository files navigation

CodeFrame - Multi-Language Code Parser

Supported Languages

Features

Usage

Build the project

Run analysis (two arguments required)

Docker workflow

1) Build the image

2) Run the container with volumes

3) Run the program inside the container

Output

Ignore patterns (.ignore)

Why JSONL?

Output Structure

Architecture

Core Components

Language Analyzers

Tree-sitter Integration

Architectural Decisions

1) Choosing Tree-sitter

2) Java binding: tree-sitter bonede

Extending

Adding a New Language

Example: Adding Go Support

Requirements

License

Limitations

All languages

JavaScript

C#

Java

Testing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages