This Python script converts structured DOCX files into MDX (Markdown + JSX) format, preserving metadata, content structure, and color formatting while ensuring correct spacing, indentation, and color conversions.
This is specific for NASA VEDA information. Use file template/test_LIS.docx as the template and then fill template/test_LIS.docx in the appropriate information for each section.
- ✅ Extracts metadata, structured tables, and formatted text from DOCX (use file test_LIS.docx for the proper format)
- ✅ Handles a dynamic number of layers (e.g., supporting datasets with varying numbers of layers, not limited to a fixed count).
- ✅ Converts colors between Hex ↔ RGB if needed
- ✅ Builds complete MDX content in memory for efficient single-pass writing.
- ✅ Ensures clean YAML output for frontmatter, preventing common formatting issues.
This currently should only be run with a single landing page collection. For example, Land Information System - Alaska in fill template/test_LIS.docx will have four different layers, but will be featured on VEDA data catalog as a single item. In the previous link (for a different dataset), all of the information will be populated and when clicking Explore Data each of the individual layers will be populated based on the information you add. This script will support an infinite number of layers (as long as the same formatting between layers is used).
It is not necessary to create a new conda environment due to minimal libraries installed, but a new environment can be created with
conda env create -f setup/docx2mdx_env.yaml
conda activate docx2mdxEnsure you have Python >=3.7 installed.
The primary dependencies are python-docx and ruamel.yaml.
Run:
pip install -r setup/requirements.txtThe script uses minimal external libraries, so creating a dedicated Conda environment might not be strictly necessary, but it's good practice for reproducibility.
Run the script with:
python dump.py /path/to/input.docx rgb_or_hex_stringExample:
python dump.py "template/test_LIS.docx" "rgb"or
python dump.py "template/test_LIS.docx" "hex"This automatically:
- Extracts DOCX table and prose information
- Converts it into a structured MDX file
- Saves into
markdown/directory
Automatically converts colors between Hex and RGB based on user preference.
🔹 Function: color_converter()
def color_converter(color, hex_or_rgb="rgb"):
"""
Converts Hex ↔ RGB based on user preference.
"""✅ Converts #FF5733 → (255, 87, 51)
✅ Converts rgb(255, 87, 51) → #FF5733
✅ Keeps format intact if already correct
Extracts table data, metadata, and prose blocks while preserving formatting.
🔹 Function: convert_docx_to_mdx_path()
def convert_docx_to_mdx_path(docx_path):
"""
Converts a .docx file path to .data.mdx in 'converted_markdown'.
"""- Creates
markdown/folder - Renames
.docx→.data.mdx - Saves the formatted MDX file
The script assembles the complete MDX content, including YAML frontmatter and all prose blocks, in memory before writing to the output file. This ensures correct structure and ordering.
Your final MDX file will look like this:
---
id: lis-alaska-nrt
name: Land Information System - Alaska
description: State of Alaska vegetation and hydrological information produced by NASA’s
Short-term Prediction and Transition Center – Land Information System (SPoRT-LIS).
layers:
- id: alaska_relative_soil_moisture_10cm
stacCol: lis_ak_rsm_10cm
stacApiEndpoint: https://dev.openveda.cloud/api/stac
name: Relative Soil Moisture (0-10cm), Updated Daily
type: raster
description: Relative soil moisture (RSM) is a ratio of the volumetric soil moisture
between the wilting and saturation points for a given soil type.
legend:
unit:
label: Percentage %
type: gradient
min: 0
max: 100
stops:
- rgb(60,40,180)
- rgb(111,96,219)
- rgb(160,139,255)
- rgb(149,209,251)
---
<Block>
<Prose>
**Temporal Extent:** 6 days prior - Present<br />
**Temporal Resolution:** Daily<br />
**Spatial Extent:** Alaska<br />
**Spatial Resolution:** 0.03° x 0.03°<br />
**Data Type:** Research<br />
**Data Latency:** Updated Daily
</Prose>
</Block>
<Block>
<Prose>
## Source Data Product Citation
Kumar, S.V., C.D. Peters-Lidard, Y. Tian, P.R. Houser, J. Geiger, S. Olden, L. Lighty, J.L. Eastman, B. Doty, P. Dirmeyer, J. Adams, K. Mitchell, E. F. Wood, and J. Sheffield.
</Prose>
</Block>This project is open-source under the MIT License.