common: util: stop using binary mode in SetContents #2371

IEncinas10 · 2025-03-27T18:39:27Z

Issue #2370 points out that inplace formatting in Windows converts \r\n into \n.

This happens because opening files in binary mode explicitly disables the otherwise implicit translation of \n into \r\n for Windows.

linux-foundation-easycla · 2025-03-27T18:39:34Z

The committers listed above are authorized under a signed CLA.

✅ login: IEncinas10 / name: Ignacio Encinas Rubio (0b21c64, f04149c, feaaa5a, b020a49)

IEncinas10 · 2025-03-27T18:41:46Z

I'm not really sure this fixes the issue but seems reasonable (?) Let's see if the CI produces binaries so that we can test it on Windows. Open for suggestions if there is a better way :)

hzeller · 2025-03-27T19:04:11Z

mmh, I rather would like file operations behave correctly (output exactly what we want) and rather do choices of what to output in content we generate.

IEncinas10 · 2025-03-27T19:24:08Z

I guess another potential fix would be changing FormatWhitespaceWithDisabledByteRanges to know whether it is being run in Windows and output "\r\n" instead of "\n"?

Edit: Unless I'm missing something, the "non inplace" path directly outputs to std::cout which pressumably does the \n --> \r\n translation.

hzeller · 2025-03-27T20:07:15Z

You try to fix the case of having CRLF output on Windows, but that is also not necessarily a good default (most developers who have to work on Windows prefer to use LF these days). So I am thinking of it more in terms of configurability.

Starting with a flag, probably somewhere around here
https://github.com/chipsalliance/verible/blob/master/verible/common/formatting/basic-format-style.h
(or in https://github.com/chipsalliance/verible/blob/master/verible/verilog/formatting/format-style.h but I think this is generic enough to get into the generic options) that allows to choose an output option to be one of LF, CRLF, AUTO (luckily the MacOS style \r-only is out of stlye) and then find all the places where they should apply.

Then it is possible to fix whatever place outputs newlines and also test on any platform that everything works as it should if forcing the one or other output in unit tests and observe that this is indeed working.
Then the 'auto' thing is easy to implement as that can configure downstream to whatever it has seen in the input.

hzeller · 2025-03-27T20:10:01Z

Edit: Unless I'm missing something, the "non inplace" path directly outputs to std::cout which pressumably does the \n --> \r\n translation.

We probably want to fix that as well to one or the other output once we have the flag.

IEncinas10 · 2025-04-01T17:12:06Z

You try to fix the case of having CRLF output on Windows, but that is also not necessarily a good default (most developers who have to work on Windows prefer to use LF these days). So I am thinking of it more in terms of configurability.

Makes sense. I had no idea people worked around this CRLF issue in Windows, thats why I assumed the "fix" would be to stop using binary mode.

Starting with a flag, probably somewhere around here https://github.com/chipsalliance/verible/blob/master/verible/common/formatting/basic-format-style.h (or in https://github.com/chipsalliance/verible/blob/master/verible/verilog/formatting/format-style.h but I think this is generic enough to get into the generic options) that allows to choose an output option to be one of LF, CRLF, AUTO (luckily the MacOS style \r-only is out of stlye) and then find all the places where they should apply.

Then it is possible to fix whatever place outputs newlines and also test on any platform that everything works as it should if forcing the one or other output in unit tests and observe that this is indeed working. Then the 'auto' thing is easy to implement as that can configure downstream to whatever it has seen in the input.

Chasing around every \n doesn't seem all that fun, what do you think of using a flag to either:

Follow the platform where verible is being run. In other words, stop using binary mode and rely on (libc?) to translate \n into \r\n or whatever
Just use \n (what is currently happening for verible-verilog-format --inline)

This should be easier to implement

hzeller · 2025-04-01T17:38:17Z

I suspect there is only really one or two places where newlines are emitted.

There are a few reasons why I want the internal representation to be a faithful representation of all the bytes in the input file not line-ending modified.

One of the things that Verible does is to keep track of the position in the file by having substrings into that file; if a file with one-line-ending is converted into a content with a different line-ending in memory, then the positions we report are somewhat off to the actual positions in files ( I suspect this is was the issue with #1950 but I couldn't really test on Windows).

Once someone implements memory mapping on Windows ( https://github.com/chipsalliance/verible/blob/master/verible/common/util/file-util.cc#L231 ) we will also have the full binary representation in memory on read.

On parsing we alerady correctly identify newlines in any shape ( https://github.com/chipsalliance/verible/blob/master/verible/verilog/parser/verilog.lex#L130 is doing that). That is responsible for essentially counting lines and keep track of relative distances from the beginning of the file.

When emitting text in formatting, we need to make sure we emit newlines as desired. And that is where I suspect it really is one or two places where that is happening.
For hacking on it I'd probably do something to emit something ridiculous (such as "/*HELLO*/\n" ) at the one (or two?) place we suspect these to be emitted to visually see if we don't miss anything before we apply the custom line ending.

I suspect there might be a discrepancy if we copy multiple lines verbatim from the input, e.g. some multi-line comment. If the input is CRLF, it will stay CRLF even if we want something else; maybe we already handle this case correctly; this corner case however would be actually benign when on Windows with CRLF output requirement as it does not change things; only if the user actually wants the output to be CRLF -> LF.

A new issue just came in #2383 which looks like it is in the same ballpark of problems.

Thanks for working on this, Ignacio!

IEncinas10 · 2025-04-01T18:43:15Z

Thank you for the deep dive! I'll see if I can get away with rubber duck debugging + GitHub CI. If it gets too messy I'll try my luck with Bazel on Windows :)

IEncinas10 · 2025-04-07T19:53:29Z

Sorry for the delay, I had a complicated week 😆. I'm not entirely happy with the result but it might be good enough

hzeller · 2025-04-09T18:32:15Z

Sorry, also busy week, I hope to get to it to review till Friday.

The chipsalliance folks have added a CLA to sign, you probably have to go through that process so that things can be merged.

hzeller

Nice!

I think what we need now is a test that creates to input with \n and with \r\n line endings, and then we run the formatter with line terminator style CR and CRLF and observe that whatever the input, the output corresponds to the chosen form.
(and possibly that inplace and from stdout behaves the same)

Pseudocode

for input_lineend in '\n' '\r\n'` ; do
    create_input_with_line_endings($input_lineend) > input.v
    for output_lineend in CR CRLF ; do
        verible-verilog-format --line_terminator=$output_lineend input.v > out-stream.v
        verible-verilog-format --line_terminator=$output_lineend -i input.v
        verify_that_line_endings_are($output_lineend, input.v)

        # also streamed output and in-place edited input should be the same
        cmp input.v out-stream.v
        fail_if_$?_of_command_before_is_not_zero
   done
done

hzeller · 2025-04-14T19:03:14Z

verible/common/formatting/basic-format-style.cc

+    case LineTerminatorStyle::kCRLF:
+      return stream << "\r\n";
+  }
+  __builtin_unreachable();


Interesting, did you have to do that ? The compiler should be able to figure that out by itself.

Not at all. I just added it out of caution (a new enum case might be defined without updating the print blah blah) I can drop it :)

Modern compilers will automatically warn if there is a new enum that is not handled, so no need (In fact, I always avoid adding a default case for these enum-switch/case to automatically have the compiler fine them for me).

(I am more worried of __builtin_unreachable() being very specific to gcc and clang and might stump other compilers)

I'll drop it, no worries. I added it out of habit because where I work people absolutely ignore compiler warnings 😆

IEncinas10 · 2025-04-15T06:29:56Z

@cezchitos it would be great if you could test this :)

verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv

I just realized my change breaks comment formatting:

First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")

Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this

Edit: This issue seems to go away with #2370 (comment) Great news :D

IEncinas10 · 2025-04-15T06:31:14Z

Nice!

I think what we need now is a test that creates to input with \n and with \r\n line endings, and then we run the formatter with line terminator style CR and CRLF and observe that whatever the input, the output corresponds to the chosen form. (and possibly that inplace and from stdout behaves the same)

Pseudocode

for input_lineend in '\n' '\r\n'` ; do
    create_input_with_line_endings($input_lineend) > input.v
    for output_lineend in CR CRLF ; do
        verible-verilog-format --line_terminator=$output_lineend input.v > out-stream.v
        verible-verilog-format --line_terminator=$output_lineend -i input.v
        verify_that_line_endings_are($output_lineend, input.v)

        # also streamed output and in-place edited input should be the same
        cmp input.v out-stream.v
        fail_if_$?_of_command_before_is_not_zero
   done
done

Thanks for reviewing. I'll work on those tests!

icamaster · 2025-04-15T17:18:48Z

@cezchitos it would be great if you could test this :)

verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv

I just realized my change breaks comment formatting:

First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")

Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this

Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

IEncinas10 · 2025-04-15T17:25:12Z

@cezchitos it would be great if you could test this :)
verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv
I just realized my change breaks comment formatting:
First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")
Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this
Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

Thank you very much! Yeah, it seems we need your change. We have two options:

You merge your change separately and I rebase on top of it
You give me a patch and I include it on my branch

cezchitos · 2025-04-16T07:50:57Z

@cezchitos it would be great if you could test this :)
verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv
I just realized my change breaks comment formatting:
First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")
Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this
Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

Thank you very much! Yeah, it seems we need your change. We have two options:
1. You merge your change separately and I rebase on top of it

2. You give me a patch and I include it on my branch

I think 2. would be the best. Unfortunately I am not familiar at all with the lexer, I just did the modifications in #2370 (comment) more or less by trial and error. Is the '.lex' actually generated by something else or is that patch enough?

hzeller · 2025-04-17T02:43:11Z

The patch in #2370 (comment) looks fishy though. It looks like it makes a character class that contains \r, \n and also |. This might accidentally work, but it will break with a pipe character and will not properly count lines on windows.

If anything, I'd expected rounded parenthesis around (\r\n|\n|\r|\0) to show the alternatives.

If the patch with the square-bracket works, I'd say it is probably accidental and it should be properly tested and explained by the flex documentation. From how I read it it looks like it should not work.

hzeller · 2025-04-17T03:07:50Z

I tested: as expected with the square brackets, the lexer unit test will fail (and a bunch of other tests).

If I had a guess, it might be probably due to the overly simplistic TextStructureView::LinesInfo::Get() implementation, as it might keep the trailing \r when splitting \r\n at the \n mark.

hzeller · 2025-05-01T12:34:34Z

What is the current status here ?

cezchitos · 2025-05-01T16:33:28Z

I tested: as expected with the square brackets, the lexer unit test will fail (and a bunch of other tests).

If I had a guess, it might be probably due to the overly simplistic TextStructureView::LinesInfo::Get() implementation, as it might keep the trailing \r when splitting \r\n at the \n mark.

It seemed a bit too easy to be true. Ok, so let's just forget my lexer change proposal.

What is the current status here ?

Since my lexer 'hack' didn't work I think there still needs to be a change for this particular case (comment formatting problem) before this PR can be merged, otherwise it will be in an (arguably) worse state (for a Windows user point of view) until it is fixed. I am happy to continue testing if there are any changes.
@IEncinas10 Would you be able to look at the TextStructureView::LinesInfo::Get() as suggested? If not, I can try to have a look in the future.

IEncinas10 · 2025-05-01T20:35:06Z

Sorry about the delay, things got pretty busy for me unexpectedly...

I'll try to write those tests we discussed and investigate the issue over the weekend :)

Edit:

I'm not familiar with lex at all, but this hack seems to solve the issue:

diff --git i/verible/verilog/parser/verilog.lex w/verible/verilog/parser/verilog.lex
index f90216cc1955..cd35b7d22d47 100644
--- i/verible/verilog/parser/verilog.lex
+++ w/verible/verilog/parser/verilog.lex
@@ -294,7 +294,10 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected
     yymore();
   }
   {LineTerminator} {
-    yyless(yyleng-1);  /* return \n to input stream */
+    if(yytext[yyleng-2] == '\r')
+      yyless(yyleng-2);  /* return \n to input stream */
+    else
+      yyless(yyleng-1);  /* return \n to input stream */
     UpdateLocation();
     yy_pop_state();
     return TK_EOL_COMMENT;

The error I'm getting seems to happen only inside commented blocks (and only when going from \n newline into \r\n newline mode) and this mitigates it

Instead of explicitly checking against "-" rely on IsStdin for better readability.

… by verible Newlines `\n` are transformed into `\r\n` by default when running in Windows systems. This is NOT what we want. For more information see 13e0cd0 (If non-mmap: use stdio reading of files instead of std::stream, 2024-10-02) This was already fixed for inplace formatting because the output file is opened in binary mode, but the non-inplace path was left untouched. Set stdout to binary mode too to make the behaviour consistent.

hzeller

Nice, just a few smallish comments.

I am wondering if this will be working for the 'formatting' feature in the language server as well (around here https://github.com/chipsalliance/verible/blob/master/verible/verilog/tools/ls/verible-lsp-adapter.cc#L277 ).

I don't know if windows-based editors even expects some sort of CR or CRLF in formatted output in the language server or if they auto-convert anyway. But worth checking with the spec and if we emit the right thing there.

hzeller · 2025-05-14T16:32:10Z

verible/common/formatting/basic-format-style.cc

+  return kLineTerminatorStyleStringMap;
+}
+
+std::ostream &operator<<(std::ostream &stream, LineTerminatorStyle style) {


mmh, I don't know if I like the auto-conversion from the enum to a string.
When I saw the code in comment-controls.cc I was surprised about printing an enum and then found this operator<<. So it didn't pass the "least surprise" test :)

So this might be too clever and probably should just have regular function. Something like

EmitLineTerminator(LineTerminatorStyle style, std::ostream &stream);

Makes sense. I hadn't heard about the "least surprise" before :)

I still need to keep the operator<< to make the enum-flags.h "machinery" happy.

verible/verilog/formatting/comment-controls.h

hzeller · 2025-05-14T16:42:19Z

verible/verilog/parser/verilog.lex

@@ -294,7 +294,11 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected
    yymore();
  }
  {LineTerminator} {
-    yyless(yyleng-1);  /* return \n to input stream */
+    // HACK


Can you add an explanation that the next person looking at this can well understand what is going on here ?

We allow \r-terminated lines in LineTerminator, looks like it makes sense to see if this is handled correctly in the lexer test (e.g. with two lines, separated by \r\r). There is a chance that there could be a situation where it is not.

Sure! I'll add a brief comment here.

About the cornercase youre mentioning: I think this is run only if we're in <IN_EOL_COMMENT>, that's why I figured we were fine doing yyleng-2.

Also, if we encounter an \r we'll exit the <IN_EOL_COMMENT> state, so I don't think we can get to this snippet with \r\r.

verible/verilog/tools/formatter/format_line_terminator_test.sh

verible/verilog/formatting/comment-controls.cc

hzeller · 2025-05-14T16:55:27Z

verible/common/formatting/basic-format-style.h

+  kLF,
+  // Carriage return + Line Feed `\r\n` (DOS Style)
+  kCRLF,
+};


As a follow-up, we should consider here some sort of 'auto' setting that then maybe looks at some statistics of the input newline-types it has seen and choose the corresponding output type. Not here, follow-up for later is sufficient.

hzeller · 2025-05-14T17:01:38Z

verible/verilog/tools/formatter/format_line_terminator_test.sh

+
+for original_newline in LF CRLF; do
+  if [[ "$original_newline" == "CRLF" ]]; then
+    unix2dos ${MY_INPUT_FILE}


I wonder if the dependency on unix2dos and dos2unix might cause some trouble on various machines that have to install these first to run the test successfully (case in point: the places where you have to add these in the verible-ci.yml).

Maybe some simple awk '{printf("%s\n", $0);}' and awk '{printf("%s\r\n", $0);}' (or something with sed).

(we already use sed and awk in some other tests, so at least it would not widen the dependencies on the system).

I think I figured out how to skip the dependency. I also thought it was a bit unnecesary but I wanted to ensure my hack worked on the CI as I don't have access to Windows/Mac machines.

verible/verilog/formatting/comment-controls_test.cc

Precise control of file contents is required. This makes verible handle files in `binary` mode, effectively disabling the platform-specific hooks that for example translate \n into \r\n for DOS systems. Let's introduce the line_terminator flag into BasicFormatStyle so that users can decide which line terminator they want to use. Current options are \n (LF) or \r\n (CRLF). LF is the default configuration.

DOS newlines (\r\n) leak the \r character into TK_EOL_COMMENT token. If the formatter changes the newline mode from UNIX into DOS, the original and the formatted version will be lexically different, triggering an error inside the VerifyFormatting

hzeller · 2025-05-19T07:48:53Z

Thanks a lot Ignacio. Merged.
Thanks for being a regular contributor to Verible!

IEncinas10 · 2025-05-19T08:59:59Z

Thanks a lot Ignacio. Merged.
Thanks for being a regular contributor to Verible!

Thank you for having such a great codebase and your reviews!

Edit: I meant to edit the PR name/description as my original approach wasnt the final implementation, but I forgot. Hopefully nobody looks at the merge commit itself. Sorry about that!

cezchitos mentioned this pull request Apr 2, 2025

On Windows, CRLF line endings are not preserved when using '--inplace' #2370

Open

IEncinas10 force-pushed the fix-windows-newline-handling branch from c22750b to d566ba4 Compare April 7, 2025 19:52

IEncinas10 force-pushed the fix-windows-newline-handling branch from d566ba4 to 2e4d875 Compare April 7, 2025 20:30

hzeller reviewed Apr 14, 2025

View reviewed changes

IEncinas10 added 2 commits May 1, 2025 22:42

verilog: tools: formatter: use IsStdin against input filename

0b21c64

Instead of explicitly checking against "-" rely on IsStdin for better readability.

IEncinas10 force-pushed the fix-windows-newline-handling branch 2 times, most recently from a3d993e to 57818be Compare May 2, 2025 21:36

IEncinas10 requested a review from hzeller May 13, 2025 18:55

hzeller reviewed May 14, 2025

View reviewed changes

IEncinas10 force-pushed the fix-windows-newline-handling branch 3 times, most recently from 300e454 to c636aa9 Compare May 15, 2025 20:06

hzeller reviewed May 16, 2025

View reviewed changes

verible/verilog/formatting/comment-controls_test.cc Outdated Show resolved Hide resolved

IEncinas10 force-pushed the fix-windows-newline-handling branch from c636aa9 to 0190db8 Compare May 16, 2025 16:31

IEncinas10 added 2 commits May 16, 2025 18:32

IEncinas10 force-pushed the fix-windows-newline-handling branch from 0190db8 to feaaa5a Compare May 16, 2025 16:32

hzeller merged commit d692407 into chipsalliance:master May 19, 2025
33 checks passed

hzeller mentioned this pull request May 19, 2025

text source file with CRLF do not properly format when using "//" comment #2383

Open

hzeller mentioned this pull request Jun 3, 2025

Formatted output is lexically different from the input (inside comment). #2424

Open

common: util: stop using binary mode in SetContents #2371

common: util: stop using binary mode in SetContents #2371

Uh oh!

Conversation

IEncinas10 commented Mar 27, 2025

Uh oh!

linux-foundation-easycla bot commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IEncinas10 commented Mar 27, 2025

Uh oh!

hzeller commented Mar 27, 2025

Uh oh!

IEncinas10 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzeller commented Mar 27, 2025

Uh oh!

hzeller commented Mar 27, 2025

Uh oh!

IEncinas10 commented Apr 1, 2025

Uh oh!

hzeller commented Apr 1, 2025

Uh oh!

IEncinas10 commented Apr 1, 2025

Uh oh!

IEncinas10 commented Apr 7, 2025

Uh oh!

hzeller commented Apr 9, 2025

Uh oh!

hzeller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IEncinas10 commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IEncinas10 commented Apr 15, 2025

Uh oh!

icamaster commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IEncinas10 commented Apr 15, 2025

Uh oh!

cezchitos commented Apr 16, 2025

Uh oh!

hzeller commented Apr 17, 2025

Uh oh!

hzeller commented Apr 17, 2025

Uh oh!

hzeller commented May 1, 2025

Uh oh!

cezchitos commented May 1, 2025

Uh oh!

IEncinas10 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzeller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IEncinas10 May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Mar 27, 2025 •

edited

Loading

IEncinas10 commented Mar 27, 2025 •

edited

Loading

IEncinas10 commented Apr 15, 2025 •

edited

Loading

icamaster commented Apr 15, 2025 •

edited

Loading

IEncinas10 commented May 1, 2025 •

edited

Loading

IEncinas10 May 14, 2025 •

edited

Loading

IEncinas10 commented May 19, 2025 •

edited

Loading