Skip to content

fix: stop flow inference from hanging on large self-assignments#1116

Open
lewis6991 wants to merge 1 commit into
EmmyLuaLs:mainfrom
lewis6991:flow-budget-fallback
Open

fix: stop flow inference from hanging on large self-assignments#1116
lewis6991 wants to merge 1 commit into
EmmyLuaLs:mainfrom
lewis6991:flow-budget-fallback

Conversation

@lewis6991

Copy link
Copy Markdown
Collaborator

Problem

  • Repeated self-dependent assignments can make flow inference spend too long in a single query.
  • The issue is not limited to concat; similar repeated arithmetic or comparison assignments can hit the same path.
  • When this happens, semantic model construction can stall.

Solution

  • Add a per-flow-query step budget to the iterative flow scheduler.
  • When the budget is exceeded, log a warning, clear in-progress flow cache guards, cache unknown for active/pending queries, and keep analysis moving.
  • Add an internal development/test option to disable the fallback when a runaway query needs to be reproduced.

Tests

  • cargo test -p emmylua_code_analysis test_issue_1114_repeated_self_dependent_assignments_build_semantic_model -- --nocapture
  • cargo test -p emmylua_code_analysis flow
  • cargo fmt --all --check
  • git -c core.fsmonitor=false diff --check

Fixes #1114

Log a warning and return unknown when a single flow query exceeds the step
budget. This keeps semantic model construction moving for pathological repeated
assignments while preserving normal flow precision.

Add a development/test switch that can turn the fallback off when a runaway
flow query needs to be reproduced.

Fixes EmmyLuaLs#1114
Assisted-by: Codex

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my code review of the changes:

Issues Found:

1. Potential Logic Error: cache_options field not used in get_infer_cache

  • File: infer_cache_manager.rs
  • Issue: The cache_options field is stored but only analysis_phase is being modified. Other fields in CacheOptions (like disable_flow_inference_step_budget) are being overridden/ignored.
  • Suggestion: Either use the full cache_options or document why only analysis_phase needs to be overridden.

2. Missing Clone implementation for CacheOptions

  • File: cache_options.rs
  • Issue: CacheOptions is Clone + Copy but LuaAnalysisPhase is only Clone + Copy. This is fine, but ensure all consumers expect this behavior.
  • Suggestion: Verify that no code depends on CacheOptions being mutable after creation.

3. Potential Performance Issue: Cloning cache_options on every cache access

  • File: infer_cache_manager.rs line 22
  • Issue: let mut cache_options = self.cache_options; creates a copy on every cache lookup, which is unnecessary since it's Copy.
  • Suggestion: Use self.cache_options directly or reference it without copying.

4. Inconsistent Budget Handling

  • File: get_type_at_flow.rs
  • Issue: The budget check in evaluate_walk returns ContinueWalk without consuming a step, while start_query consumes a step. This could lead to inconsistent budget tracking.
  • Suggestion: Ensure step consumption is consistent across all code paths.

5. Missing Error Handling for Budget Exhaustion

  • File: get_type_at_flow.rs
  • Issue: When budget is exhausted, the function returns LuaType::Unknown silently (except for a warning log). This could mask real issues.
  • Suggestion: Consider returning a more specific error type or adding a metric/counter for budget exhaustion events.

6. Test Coverage Gap

  • File: flow.rs
  • Issue: The test only checks if get_semantic_model returns Some, not the actual type inference results.
  • Suggestion: Add assertions to verify the inferred types are correct (e.g., string, number, boolean).

Recommendations:

  1. Document the budget constant - Add a comment explaining why 50_000 was chosen and how to tune it.

  2. Consider making budget configurable - The disable_flow_inference_step_budget flag is good for debugging, but consider making the budget value configurable too.

  3. Add budget exhaustion metrics - Track how often the budget is exceeded to help tune the constant.

  4. Review thread safety - Ensure CacheOptions being Copy doesn't cause issues in concurrent scenarios.

Overall, the changes look well-structured and address the performance issue with flow inference. The budget mechanism is a good addition to prevent infinite loops or excessive computation.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a step budget (FLOW_INFERENCE_STEP_BUDGET) for flow inference queries to prevent semantic model construction from stalling on complex or deeply nested self-dependent assignments. It propagates CacheOptions through the analysis pipeline to allow disabling this budget during testing, and adds a regression test for issue 1114. I have no feedback to provide as there are no review comments to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@lewis6991

Copy link
Copy Markdown
Collaborator Author

This is a more general mitigation for issues like #1114 , it loses precision but avoids flow getting stuck. #1115 is more specific solution for #1114 that doesn't lose precision.

I'm not 100% sure about this. It improves user experience, but will mask issues in the flow engine. I added a flag so we can at least disable this for tests.

@tzssangglass

Copy link
Copy Markdown

I tested this PR locally, and it works for me.

@lewis6991

Copy link
Copy Markdown
Collaborator Author

Did you try #1115 ? That is a real fix.

@tzssangglass

tzssangglass commented Jun 20, 2026

Copy link
Copy Markdown
  1. Build EmmyLuaLS based on fix: stop repeated assignments from hanging #1115
  2. Delete the .emmy folder under the large project
  3. Start a large project

I observed that the CPU usage of the emmylua_ls process rose above 100%. After a long time, the CPU remained at 100%, exactly the same as before.

I think #1115 has not been resolved.

@lewis6991

Copy link
Copy Markdown
Collaborator Author

It's not just about the project being large. There a specific forms of code that can push analysis to work hard. #1115 fixes that for a specific case. This PR just tells analysis to give up after a while.

Are you able to provide a test case that doesn't work with #1115

@tzssangglass

Copy link
Copy Markdown

Observation: After startup, the workspace analysis enters the "flow analyze" stage. The single-threaded CPU usage remains at 100% for over 5 minutes without decreasing. This is a pure CPU calculation (without file I/O).

GDB backtrace (LWP 3795686, the thread consuming CPU resources):

  #0  mi_theap_malloc_zero_aligned_at_overalloc
  #1  hashbrown::raw::RawTableInner::fallible_with_capacity
  #2  hashbrown::raw::RawTable<T,A>::reserve_rehash
  #3  hashbrown::map::HashMap<K,V,S,A>::insert
  #4  emmylua_code_analysis::semantic::infer::infer_expr
  #5  emmylua_code_analysis::semantic::cache::LuaInferCache::with_replay_overlay
  #6  emmylua_code_analysis::semantic::infer::narrow::get_type_at_flow::FlowReplayQuery::replay_type
  #7  emmylua_code_analysis::semantic::infer::narrow::get_type_at_flow::FlowTypeEngine::start_expr_replay
  #8  emmylua_code_analysis::semantic::infer::narrow::infer_expr_narrow_type
  #9  emmylua_code_analysis::semantic::infer::infer_name::infer_var_ref_type
  #10 emmylua_code_analysis::semantic::infer::infer_name::infer_name_expr
  #11 emmylua_code_analysis::semantic::infer::infer_expr
  #12 emmylua_code_analysis::semantic::infer::infer_index::try_infer_expr_for_index
  #13 emmylua_code_analysis::semantic::infer::infer_index::infer_index_expr
  #14 emmylua_code_analysis::semantic::infer::infer_expr
  #15 emmylua_code_analysis::semantic::infer::infer_index::try_infer_expr_for_index
  #16 emmylua_code_analysis::semantic::infer::infer_index::infer_index_expr
  #17 emmylua_code_analysis::semantic::infer::infer_expr
  #18 emmylua_code_analysis::semantic::infer::infer_expr
  #19 emmylua_code_analysis::semantic::infer::infer_expr_list_types
  #20 emmylua_code_analysis::semantic::overload_resolve::resolve_signature
  #21 emmylua_code_analysis::semantic::infer::infer_call::infer_call_expr_func
  #22 emmylua_code_analysis::semantic::infer::infer_expr
  #23 emmylua_code_analysis::compilation::analyzer::lua::stats::analyze_local_stat
  #24 <LuaAnalysisPipeline as AnalysisPipeline>::analyze
  #25 emmylua_code_analysis::compilation::analyzer::analyze
  #26 emmylua_code_analysis::compilation::LuaCompilation::update_index
  #27 emmylua_code_analysis::EmmyLuaAnalysis::update_files_by_uri
  #28 emmylua_code_analysis::EmmyLuaAnalysis::reload_workspace_files
  #29 emmylua_ls::handlers::initialized::initialized_handler::{{closure}}

strace (3-second sampling, confirming that there is no file I/O on the pure CPU):

  % time     seconds  usecs/call     calls    errors syscall
    0.00    0.000000           0        78           write
  100.00    0.000000           0        78           total

log (The last few lines, the analysis has stopped after reaching the "flow analyze" stage and there has been no further progress):

  load files from workspace count: 9118
  update files: cost 2.480971976s
  module analyze: cost 19.9563ms
  decl analyze: cost 2.524074206s
  doc analyze: cost 436.86232ms
  flow analyze: cost 628.259041ms

My AI's speculation on this (I'm sorry, I don't fully understand the source code; the speculation of AI may be misleading.):

  Call stack by frame number, top to bottom:
  - #4 infer_expr
  - #5 try_infer_expr_for_index
  - #6 infer_index_expr
  - #7 infer_expr
  - #8 try_infer_expr_for_index
  - #9 infer_index_expr
  - #10 infer_expr
  - #12 instantiate_func_generic::infer_call_arg_type
  - #13 instantiate_func_generic
  - #14 resolve_signature

  Interpretations I added (not verified):

  - "mutual recursion" — Because infer_expr and infer_index_expr alternate in the stack, I inferred they are mutually recursive. However, a single gdb bt is a snapshot at one instant; it cannot 100% prove it's an infinite loop (it could be a very deep but finite call, or just happened to stop at this frame). To confirm it's a loop, you'd need multiple
  consecutive bt snapshots to see if the stack keeps growing or stays constant.
  - "on chained index a.b.c..." — A guess. infer_index might be analyzing something like a.b.c, but it could also be some other Lua construct triggering index inference. No evidence.
  - "triggered during generic function instantiation" — #12-#14 do contain instantiate_func_generic/resolve_signature in their names, so this is relatively reliable (the function names say so). But whether it's the trigger or just a frame on the recursion path, I cannot determine.
  - "NOT the self_dependent_assignment path that #1115 fixes" — Inference. #1115 modifies self_dependent_assignment_operator_type, and this stack doesn't show that function name, so I said "not the same path." But that's only "this frame doesn't show it" — it doesn't prove the triggering logic is entirely unrelated.

@lewis6991

lewis6991 commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator Author

Are you able to provide a test case that doesn't work with #1115

This please.

@CppCXY

CppCXY commented Jun 22, 2026

Copy link
Copy Markdown
Member

Observation: After startup, the workspace analysis enters the "flow analyze" stage. The single-threaded CPU usage remains at 100% for over 5 minutes without decreasing. This is a pure CPU calculation (without file I/O).

GDB backtrace (LWP 3795686, the thread consuming CPU resources):

  #0  mi_theap_malloc_zero_aligned_at_overalloc
  #1  hashbrown::raw::RawTableInner::fallible_with_capacity
  #2  hashbrown::raw::RawTable<T,A>::reserve_rehash
  #3  hashbrown::map::HashMap<K,V,S,A>::insert
  #4  emmylua_code_analysis::semantic::infer::infer_expr
  #5  emmylua_code_analysis::semantic::cache::LuaInferCache::with_replay_overlay
  #6  emmylua_code_analysis::semantic::infer::narrow::get_type_at_flow::FlowReplayQuery::replay_type
  #7  emmylua_code_analysis::semantic::infer::narrow::get_type_at_flow::FlowTypeEngine::start_expr_replay
  #8  emmylua_code_analysis::semantic::infer::narrow::infer_expr_narrow_type
  #9  emmylua_code_analysis::semantic::infer::infer_name::infer_var_ref_type
  #10 emmylua_code_analysis::semantic::infer::infer_name::infer_name_expr
  #11 emmylua_code_analysis::semantic::infer::infer_expr
  #12 emmylua_code_analysis::semantic::infer::infer_index::try_infer_expr_for_index
  #13 emmylua_code_analysis::semantic::infer::infer_index::infer_index_expr
  #14 emmylua_code_analysis::semantic::infer::infer_expr
  #15 emmylua_code_analysis::semantic::infer::infer_index::try_infer_expr_for_index
  #16 emmylua_code_analysis::semantic::infer::infer_index::infer_index_expr
  #17 emmylua_code_analysis::semantic::infer::infer_expr
  #18 emmylua_code_analysis::semantic::infer::infer_expr
  #19 emmylua_code_analysis::semantic::infer::infer_expr_list_types
  #20 emmylua_code_analysis::semantic::overload_resolve::resolve_signature
  #21 emmylua_code_analysis::semantic::infer::infer_call::infer_call_expr_func
  #22 emmylua_code_analysis::semantic::infer::infer_expr
  #23 emmylua_code_analysis::compilation::analyzer::lua::stats::analyze_local_stat
  #24 <LuaAnalysisPipeline as AnalysisPipeline>::analyze
  #25 emmylua_code_analysis::compilation::analyzer::analyze
  #26 emmylua_code_analysis::compilation::LuaCompilation::update_index
  #27 emmylua_code_analysis::EmmyLuaAnalysis::update_files_by_uri
  #28 emmylua_code_analysis::EmmyLuaAnalysis::reload_workspace_files
  #29 emmylua_ls::handlers::initialized::initialized_handler::{{closure}}

We obviously need a reproducible example that can be run independently. In practice, this kind of freeze is usually related to the shape of the code in a single file. Of course, much of the internal code cannot be made public. If you are willing to continue helping to investigate the issue, you can follow the general approach I have suggested to others for testing: copy the project's code out, cut half of it away, open the editor to test whether it loads properly. If it does not, cut away half again. If it does load properly, then the problem may lie in the other half that was cut away. Repeat this process until you find the smallest set of one or a few Lua files that can reliably reproduce the issue. If the relevant code is not convenient to disclose, you can obfuscate it manually, remove sensitive information, keep the issue reproducible, and then package and submit those files.

@tzssangglass

Copy link
Copy Markdown

OK, I will find some time to reproduce it.

@lewis6991

Copy link
Copy Markdown
Collaborator Author

Observation: After startup, the workspace analysis enters the "flow analyze" stage. The single-threaded CPU usage remains at 100% for over 5 minutes without decreasing. This is a pure CPU calculation (without file I/O).

GDB backtrace (LWP 3795686, the thread consuming CPU resources):

  #0  mi_theap_malloc_zero_aligned_at_overalloc
  #1  hashbrown::raw::RawTableInner::fallible_with_capacity
  #2  hashbrown::raw::RawTable<T,A>::reserve_rehash
  #3  hashbrown::map::HashMap<K,V,S,A>::insert
...

...

This should be fixed in #1115 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stuck loading

3 participants