Adding --direct-io flag for model loading #18166

JTischbein · 2025-12-18T11:57:29Z

Follow up for PR #18012 (comment).

To enable Direct IO model reading by default on Linux and Windows, but to stay with --mmap as default on Mac, this PR adds an additional flag for enabling/disabling Direct IO. This flag is by default true and overrules the mmap parameter. In case --direct-io is true and Direct IO is available, --mmap gets disabled. And if --no-direct-io is set or Direct IO is not available (e.g. on Mac), the specified mmap value is used.

ggerganov · 2025-12-18T15:05:53Z

I think you need this:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 1355eea95..2db2115a0 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -918,8 +918,7 @@ void llama_model_loader::load_data_for(struct ggml_tensor * cur) const {
         GGML_ASSERT(cur->data != nullptr);
         GGML_ASSERT(w.idx < files.size());
         const auto & file = files.at(w.idx);
-        file->seek(w.offs, SEEK_SET);
-        file->read_raw(cur->data, ggml_nbytes(cur));
+        file->read_raw_at(cur->data, ggml_nbytes(cur), w.offs);
     }
 
     if (check_tensors && !ggml_validate_row_data(cur->type, cur->data, ggml_nbytes(cur))) {

Probably need to assert that llama_file::read_raw is never used with direct io?

JTischbein · 2025-12-18T19:31:41Z

Thanks for the hint, changed that.

I think an assert would not work as read_raw in the current form is needed. Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

ggerganov · 2025-12-19T08:50:34Z

Would you suggest to rename read_raw_at to be read_raw (and using tell instead of the offset argument)? Like this read_raw can be safely used again and in the loop of load_all_data the current read_raw is called (as read_raw_direct?)

Ok. Would we even need to have read_raw_direct in this case? If we still needed for some reason, then maybe call it read_raw_unsafe to not overload the "direct" word with more meanings.

Also some suggestions that I have not tested, but should at least convey what I mean:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 2db2115a0..ae0c698be 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -508,8 +508,11 @@ llama_model_loader::llama_model_loader(
     files.emplace_back(new llama_file(fname.c_str(), "rb", use_direct_io));
     contexts.emplace_back(ctx);
 
-    // Disable mmap in case Direct I/O is enabled and available
-    if (use_direct_io && files.at(0)->has_direct_io()) {
+    // check if direct io is enabled and supported
+    use_direct_io = use_direct_io && files.back()->has_direct_io();
+
+    if (use_direct_io && use_mmap) {
+        LLAMA_LOG_WARN("%s: direct I/O is enabled, disabling mmap\n", __func__);
         use_mmap = false;
     }
 
@@ -581,6 +584,10 @@ llama_model_loader::llama_model_loader(
             files.emplace_back(new llama_file(fname_split, "rb", use_direct_io));
             contexts.emplace_back(ctx);
 
+            if (use_direct_io && !files.back()->has_direct_io()) {
+                throw std::runtime_error(format("unexpected: direct I/O is not supported for split file %s", fname_split));
+            }
+
             // Save tensors data offset info of the shard.
             for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
                 std::string tensor_name = std::string(cur->name);
@@ -722,6 +729,7 @@ llama_model_loader::llama_model_loader(
     }
 
     this->use_mmap = use_mmap;
+    this->use_direct_io = use_direct_io;
     this->check_tensors = check_tensors;
     this->no_alloc = no_alloc;
 }
diff --git a/src/llama-model-loader.h b/src/llama-model-loader.h
index de06b5283..6f15115ce 100644
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -70,6 +70,7 @@ struct llama_model_loader {
     size_t   n_bytes    = 0;
 
     bool use_mmap = false;
+    bool use_direct_io = false;
     bool check_tensors;
     bool no_alloc;
 
diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index cf0c39475..502859d2e 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -2337,7 +2337,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
 
     const bool use_mmap_buffer = true;
 
-    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s)\n", __func__, ml.use_mmap ? "true" : "false");
+    LLAMA_LOG_INFO("%s: loading model tensors, this can take a while... (mmap = %s, direct_io = %s)\n",
+            __func__, ml.use_mmap ? "true" : "false", ml.use_direct_io ? "true" : "false");
 
     // build a list of buffer types for the CPU and GPU devices
     pimpl->cpu_buft_list = make_cpu_buft_list(devices, params.use_extra_bufts, params.no_host);

askmyteapot · 2025-12-19T11:24:26Z

Just an FYI. #18012 Broke loading with mmap disabled on windows.

JTischbein · 2025-12-19T14:01:53Z

@askmyteapot Thank you for the hint. Issue was that I used off_t, which is a signed long on windows. The fix is in this PR.

JTischbein · 2025-12-19T14:09:58Z

@ggerganov Should I isolate the changes with the Windows fix in a new PR?

ggerganov · 2025-12-19T14:13:15Z

Yes, would like to take extra look at the changes here, so better to fix the windows issue in the meantime. Thanks

NeoZhangJianyu

What's the parameters to load the model file in this PR?
Here is my understanding, please correct me if it's wrong:
--no-mmap -ndio
--no-mmap -dio
--mmap

When must user use -ndio?

I think the parameters are a little complex.
Here is my suggestion:
We should keep: --mmap and --no-mmap.
In the case of --no-mmap, the code should detect & switch to use dio or ndio smartly in windows/linux/mac.

ehoogeveen-medweb · 2025-12-25T03:34:16Z

IIUC, the implementation in this PR currently requires passing --no-direct-io in order to enable mmap, and --no-direct-io --no-mmap to disable both. I think it should also recognize that if a user passes just --mmap, they want to disable Direct IO and use mmap (as the two options are mutually exclusive).

In other words, --mmap should imply --no-direct-io (and --direct-io should imply --no-mmap, although that doesn't matter with the current logic). Aside from that case, I think the logic is reasonable assuming that preferring Direct IO over mmap is the way to go.

JTischbein · 2025-12-26T12:46:11Z

I agree with @ehoogeveen-medweb, now explicitly specifying --mmap disables direct io. Only handling the model load using --mmap and --no-mmap does not work with the current implementation, as we need the code paths for loading via mmap, read() and std::fread().

Now there are three ways:

Default (implicitly -dio --mmap): Load via direct io and if it is not available fallback to mmap
Explicitly specifying --mmap: Load via mmap
Explicitly specifying --no-mmap -ndio: Load via std::fread()

Adding --direct-io flag for model loading

8a7a4e3

JTischbein requested review from CISC, am17an and ggerganov as code owners December 18, 2025 11:57

github-actions bot added the examples label Dec 18, 2025

Fixing read_raw() calls

0e2b356

Fixing Windows read_raw_at

7533f72

Changing type off_t to size_t for windows and Renaming functions

8c15014

Merge branch 'ggml-org:master' into direct_io_flag

4c15dc2

JTischbein mentioned this pull request Dec 24, 2025

Bug: Issue when using igpu ( syscl backend) #18296

Open

NeoZhangJianyu reviewed Dec 25, 2025

View reviewed changes

This was referenced Dec 25, 2025

Misc. bug: Running Vulkan backend on Intel DG1 outputs gibberish due to commit 38eaf32 #17302

Closed

Eval bug: MMAP off causes load model to fail #18198

Closed

Misc. bug: Vulkan Backend llama-server and llama-bench Cannot Run Model with mmap = 0 #18317

Open

disable direct io when mmap is explicitly enabled

3d428d3

engrtipusultan mentioned this pull request Dec 27, 2025

Possible Improvement Required: Prompt Processing speed decreases on Vulkan AMD Renoir APU with -fa 1 #17715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding --direct-io flag for model loading #18166

Adding --direct-io flag for model loading #18166

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 18, 2025

Uh oh!

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

askmyteapot commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

NeoZhangJianyu left a comment

Uh oh!

ehoogeveen-medweb commented Dec 25, 2025

Uh oh!

JTischbein commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Adding --direct-io flag for model loading #18166

Are you sure you want to change the base?

Adding --direct-io flag for model loading #18166

Conversation

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 18, 2025

Uh oh!

JTischbein commented Dec 18, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

askmyteapot commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

JTischbein commented Dec 19, 2025

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Uh oh!

ehoogeveen-medweb commented Dec 25, 2025

Uh oh!

JTischbein commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants