Skip to content

common: util: stop using binary mode in SetContents #2371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 19, 2025

Conversation

IEncinas10
Copy link
Collaborator

Issue #2370 points out that inplace formatting in Windows converts \r\n into \n.

This happens because opening files in binary mode explicitly disables the otherwise implicit translation of \n into \r\n for Windows.

Copy link

linux-foundation-easycla bot commented Mar 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@IEncinas10
Copy link
Collaborator Author

I'm not really sure this fixes the issue but seems reasonable (?) Let's see if the CI produces binaries so that we can test it on Windows. Open for suggestions if there is a better way :)

@hzeller
Copy link
Collaborator

hzeller commented Mar 27, 2025

mmh, I rather would like file operations behave correctly (output exactly what we want) and rather do choices of what to output in content we generate.

@IEncinas10
Copy link
Collaborator Author

IEncinas10 commented Mar 27, 2025

I guess another potential fix would be changing FormatWhitespaceWithDisabledByteRanges to know whether it is being run in Windows and output "\r\n" instead of "\n"?

Edit: Unless I'm missing something, the "non inplace" path directly outputs to std::cout which pressumably does the \n --> \r\n translation.

@hzeller
Copy link
Collaborator

hzeller commented Mar 27, 2025

You try to fix the case of having CRLF output on Windows, but that is also not necessarily a good default (most developers who have to work on Windows prefer to use LF these days). So I am thinking of it more in terms of configurability.

Starting with a flag, probably somewhere around here
https://github.com/chipsalliance/verible/blob/master/verible/common/formatting/basic-format-style.h
(or in https://github.com/chipsalliance/verible/blob/master/verible/verilog/formatting/format-style.h but I think this is generic enough to get into the generic options) that allows to choose an output option to be one of LF, CRLF, AUTO (luckily the MacOS style \r-only is out of stlye) and then find all the places where they should apply.

Then it is possible to fix whatever place outputs newlines and also test on any platform that everything works as it should if forcing the one or other output in unit tests and observe that this is indeed working.
Then the 'auto' thing is easy to implement as that can configure downstream to whatever it has seen in the input.

@hzeller
Copy link
Collaborator

hzeller commented Mar 27, 2025

Edit: Unless I'm missing something, the "non inplace" path directly outputs to std::cout which pressumably does the \n --> \r\n translation.

We probably want to fix that as well to one or the other output once we have the flag.

@IEncinas10
Copy link
Collaborator Author

You try to fix the case of having CRLF output on Windows, but that is also not necessarily a good default (most developers who have to work on Windows prefer to use LF these days). So I am thinking of it more in terms of configurability.

Makes sense. I had no idea people worked around this CRLF issue in Windows, thats why I assumed the "fix" would be to stop using binary mode.

Starting with a flag, probably somewhere around here https://github.com/chipsalliance/verible/blob/master/verible/common/formatting/basic-format-style.h (or in https://github.com/chipsalliance/verible/blob/master/verible/verilog/formatting/format-style.h but I think this is generic enough to get into the generic options) that allows to choose an output option to be one of LF, CRLF, AUTO (luckily the MacOS style \r-only is out of stlye) and then find all the places where they should apply.

Then it is possible to fix whatever place outputs newlines and also test on any platform that everything works as it should if forcing the one or other output in unit tests and observe that this is indeed working. Then the 'auto' thing is easy to implement as that can configure downstream to whatever it has seen in the input.

Chasing around every \n doesn't seem all that fun, what do you think of using a flag to either:

  1. Follow the platform where verible is being run. In other words, stop using binary mode and rely on (libc?) to translate \n into \r\n or whatever
  2. Just use \n (what is currently happening for verible-verilog-format --inline)

This should be easier to implement

@hzeller
Copy link
Collaborator

hzeller commented Apr 1, 2025

I suspect there is only really one or two places where newlines are emitted.

There are a few reasons why I want the internal representation to be a faithful representation of all the bytes in the input file not line-ending modified.

One of the things that Verible does is to keep track of the position in the file by having substrings into that file; if a file with one-line-ending is converted into a content with a different line-ending in memory, then the positions we report are somewhat off to the actual positions in files ( I suspect this is was the issue with #1950 but I couldn't really test on Windows).

Once someone implements memory mapping on Windows ( https://github.com/chipsalliance/verible/blob/master/verible/common/util/file-util.cc#L231 ) we will also have the full binary representation in memory on read.

On parsing we alerady correctly identify newlines in any shape ( https://github.com/chipsalliance/verible/blob/master/verible/verilog/parser/verilog.lex#L130 is doing that). That is responsible for essentially counting lines and keep track of relative distances from the beginning of the file.

When emitting text in formatting, we need to make sure we emit newlines as desired. And that is where I suspect it really is one or two places where that is happening.
For hacking on it I'd probably do something to emit something ridiculous (such as "/*HELLO*/\n" ) at the one (or two?) place we suspect these to be emitted to visually see if we don't miss anything before we apply the custom line ending.

I suspect there might be a discrepancy if we copy multiple lines verbatim from the input, e.g. some multi-line comment. If the input is CRLF, it will stay CRLF even if we want something else; maybe we already handle this case correctly; this corner case however would be actually benign when on Windows with CRLF output requirement as it does not change things; only if the user actually wants the output to be CRLF -> LF.

A new issue just came in #2383 which looks like it is in the same ballpark of problems.

Thanks for working on this, Ignacio!

@IEncinas10
Copy link
Collaborator Author

Thank you for the deep dive! I'll see if I can get away with rubber duck debugging + GitHub CI. If it gets too messy I'll try my luck with Bazel on Windows :)

@IEncinas10
Copy link
Collaborator Author

Sorry for the delay, I had a complicated week 😆. I'm not entirely happy with the result but it might be good enough

@IEncinas10 IEncinas10 force-pushed the fix-windows-newline-handling branch from d566ba4 to 2e4d875 Compare April 7, 2025 20:30
@hzeller
Copy link
Collaborator

hzeller commented Apr 9, 2025

Sorry, also busy week, I hope to get to it to review till Friday.

The chipsalliance folks have added a CLA to sign, you probably have to go through that process so that things can be merged.

Copy link
Collaborator

@hzeller hzeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I think what we need now is a test that creates to input with \n and with \r\n line endings, and then we run the formatter with line terminator style CR and CRLF and observe that whatever the input, the output corresponds to the chosen form.
(and possibly that inplace and from stdout behaves the same)

Pseudocode

for input_lineend in '\n' '\r\n'` ; do
    create_input_with_line_endings($input_lineend) > input.v
    for output_lineend in CR CRLF ; do
        verible-verilog-format --line_terminator=$output_lineend input.v > out-stream.v
        verible-verilog-format --line_terminator=$output_lineend -i input.v
        verify_that_line_endings_are($output_lineend, input.v)

        # also streamed output and in-place edited input should be the same
        cmp input.v out-stream.v
        fail_if_$?_of_command_before_is_not_zero
   done
done

case LineTerminatorStyle::kCRLF:
return stream << "\r\n";
}
__builtin_unreachable();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, did you have to do that ? The compiler should be able to figure that out by itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. I just added it out of caution (a new enum case might be defined without updating the print blah blah) I can drop it :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modern compilers will automatically warn if there is a new enum that is not handled, so no need (In fact, I always avoid adding a default case for these enum-switch/case to automatically have the compiler fine them for me).

(I am more worried of __builtin_unreachable() being very specific to gcc and clang and might stump other compilers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop it, no worries. I added it out of habit because where I work people absolutely ignore compiler warnings 😆

@IEncinas10
Copy link
Collaborator Author

IEncinas10 commented Apr 15, 2025

@cezchitos it would be great if you could test this :)

verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv

I just realized my change breaks comment formatting:

First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")

Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this

Edit: This issue seems to go away with #2370 (comment) Great news :D

@IEncinas10
Copy link
Collaborator Author

Nice!

I think what we need now is a test that creates to input with \n and with \r\n line endings, and then we run the formatter with line terminator style CR and CRLF and observe that whatever the input, the output corresponds to the chosen form. (and possibly that inplace and from stdout behaves the same)

Pseudocode

for input_lineend in '\n' '\r\n'` ; do
    create_input_with_line_endings($input_lineend) > input.v
    for output_lineend in CR CRLF ; do
        verible-verilog-format --line_terminator=$output_lineend input.v > out-stream.v
        verible-verilog-format --line_terminator=$output_lineend -i input.v
        verify_that_line_endings_are($output_lineend, input.v)

        # also streamed output and in-place edited input should be the same
        cmp input.v out-stream.v
        fail_if_$?_of_command_before_is_not_zero
   done
done

Thanks for reviewing. I'll work on those tests!

@icamaster
Copy link

icamaster commented Apr 15, 2025

@cezchitos it would be great if you could test this :)

verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv

I just realized my change breaks comment formatting:

First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")

Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this

Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

@IEncinas10
Copy link
Collaborator Author

@cezchitos it would be great if you could test this :)
verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv
I just realized my change breaks comment formatting:
First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")
Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this
Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

Thank you very much! Yeah, it seems we need your change. We have two options:

  1. You merge your change separately and I rebase on top of it
  2. You give me a patch and I include it on my branch

@cezchitos
Copy link

@cezchitos it would be great if you could test this :)
verible-verilog-format --line_terminator=CRLF file.sv and verible-verilog-format --line_terminator=CRLF --inplace file.sv
I just realized my change breaks comment formatting:
First mismatched token [0]: ("// end of line comment") (#714: "// some comment") vs. ("// end of line comment") (#714: "// some comment\r")
Hopefully this issue doesn't happen if you use the line_terminator that matches your platform's. I'll work on fixing this
Edit: This issue seems to go away with #2370 (comment) Great news :D

I have just managed to test and build. As long as I include the change in #2370 (comment) it all works great, nice work! Without the additional lex change I get the following error: 'Formatted output is lexically different from the input.' Maybe best to include that as well?

Thank you very much! Yeah, it seems we need your change. We have two options:

1. You merge your change separately and I rebase on top of it

2. You give me a patch and I include it on my branch

I think 2. would be the best. Unfortunately I am not familiar at all with the lexer, I just did the modifications in #2370 (comment) more or less by trial and error. Is the '.lex' actually generated by something else or is that patch enough?

@hzeller
Copy link
Collaborator

hzeller commented Apr 17, 2025

The patch in #2370 (comment) looks fishy though. It looks like it makes a character class that contains \r, \n and also |. This might accidentally work, but it will break with a pipe character and will not properly count lines on windows.

If anything, I'd expected rounded parenthesis around (\r\n|\n|\r|\0) to show the alternatives.

If the patch with the square-bracket works, I'd say it is probably accidental and it should be properly tested and explained by the flex documentation. From how I read it it looks like it should not work.

@hzeller
Copy link
Collaborator

hzeller commented Apr 17, 2025

I tested: as expected with the square brackets, the lexer unit test will fail (and a bunch of other tests).

If I had a guess, it might be probably due to the overly simplistic TextStructureView::LinesInfo::Get() implementation, as it might keep the trailing \r when splitting \r\n at the \n mark.

@hzeller
Copy link
Collaborator

hzeller commented May 1, 2025

What is the current status here ?

@cezchitos
Copy link

I tested: as expected with the square brackets, the lexer unit test will fail (and a bunch of other tests).

If I had a guess, it might be probably due to the overly simplistic TextStructureView::LinesInfo::Get() implementation, as it might keep the trailing \r when splitting \r\n at the \n mark.

It seemed a bit too easy to be true. Ok, so let's just forget my lexer change proposal.

What is the current status here ?

Since my lexer 'hack' didn't work I think there still needs to be a change for this particular case (comment formatting problem) before this PR can be merged, otherwise it will be in an (arguably) worse state (for a Windows user point of view) until it is fixed. I am happy to continue testing if there are any changes.
@IEncinas10 Would you be able to look at the TextStructureView::LinesInfo::Get() as suggested? If not, I can try to have a look in the future.

@IEncinas10
Copy link
Collaborator Author

IEncinas10 commented May 1, 2025

Sorry about the delay, things got pretty busy for me unexpectedly...

I'll try to write those tests we discussed and investigate the issue over the weekend :)

Edit:

I'm not familiar with lex at all, but this hack seems to solve the issue:

diff --git i/verible/verilog/parser/verilog.lex w/verible/verilog/parser/verilog.lex
index f90216cc1955..cd35b7d22d47 100644
--- i/verible/verilog/parser/verilog.lex
+++ w/verible/verilog/parser/verilog.lex
@@ -294,7 +294,10 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected
     yymore();
   }
   {LineTerminator} {
-    yyless(yyleng-1);  /* return \n to input stream */
+    if(yytext[yyleng-2] == '\r')
+      yyless(yyleng-2);  /* return \n to input stream */
+    else
+      yyless(yyleng-1);  /* return \n to input stream */
     UpdateLocation();
     yy_pop_state();
     return TK_EOL_COMMENT;

The error I'm getting seems to happen only inside commented blocks (and only when going from \n newline into \r\n newline mode) and this mitigates it

IEncinas10 added 2 commits May 1, 2025 22:42
Instead of explicitly checking against "-" rely on IsStdin for better
readability.
… by verible

Newlines `\n` are transformed into `\r\n` by default when running in
Windows systems. This is NOT what we want. For more information see

  13e0cd0 (If non-mmap: use stdio reading of files instead of std::stream, 2024-10-02)

This was already fixed for inplace formatting because the output file is
opened in binary mode, but the non-inplace path was left untouched. Set
stdout to binary mode too to make the behaviour consistent.
@IEncinas10 IEncinas10 force-pushed the fix-windows-newline-handling branch 2 times, most recently from a3d993e to 57818be Compare May 2, 2025 21:36
@IEncinas10 IEncinas10 requested a review from hzeller May 13, 2025 18:55
Copy link
Collaborator

@hzeller hzeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just a few smallish comments.

I am wondering if this will be working for the 'formatting' feature in the language server as well (around here https://github.com/chipsalliance/verible/blob/master/verible/verilog/tools/ls/verible-lsp-adapter.cc#L277 ).

I don't know if windows-based editors even expects some sort of CR or CRLF in formatted output in the language server or if they auto-convert anyway. But worth checking with the spec and if we emit the right thing there.

return kLineTerminatorStyleStringMap;
}

std::ostream &operator<<(std::ostream &stream, LineTerminatorStyle style) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh, I don't know if I like the auto-conversion from the enum to a string.
When I saw the code in comment-controls.cc I was surprised about printing an enum and then found this operator<<. So it didn't pass the "least surprise" test :)

So this might be too clever and probably should just have regular function. Something like

EmitLineTerminator(LineTerminatorStyle style, std::ostream &stream);

Copy link
Collaborator Author

@IEncinas10 IEncinas10 May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I hadn't heard about the "least surprise" before :)

I still need to keep the operator<< to make the enum-flags.h "machinery" happy.

@@ -294,7 +294,11 @@ PragmaEndProtected {Pragma}{Space}+protect{Space}+end_protected
yymore();
}
{LineTerminator} {
yyless(yyleng-1); /* return \n to input stream */
// HACK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an explanation that the next person looking at this can well understand what is going on here ?

We allow \r-terminated lines in LineTerminator, looks like it makes sense to see if this is handled correctly in the lexer test (e.g. with two lines, separated by \r\r). There is a chance that there could be a situation where it is not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll add a brief comment here.

About the cornercase youre mentioning: I think this is run only if we're in <IN_EOL_COMMENT>, that's why I figured we were fine doing yyleng-2.

Also, if we encounter an \r we'll exit the <IN_EOL_COMMENT> state, so I don't think we can get to this snippet with \r\r.

kLF,
// Carriage return + Line Feed `\r\n` (DOS Style)
kCRLF,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, we should consider here some sort of 'auto' setting that then maybe looks at some statistics of the input newline-types it has seen and choose the corresponding output type. Not here, follow-up for later is sufficient.


for original_newline in LF CRLF; do
if [[ "$original_newline" == "CRLF" ]]; then
unix2dos ${MY_INPUT_FILE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the dependency on unix2dos and dos2unix might cause some trouble on various machines that have to install these first to run the test successfully (case in point: the places where you have to add these in the verible-ci.yml).

Maybe some simple awk '{printf("%s\n", $0);}' and awk '{printf("%s\r\n", $0);}' (or something with sed).

(we already use sed and awk in some other tests, so at least it would not widen the dependencies on the system).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out how to skip the dependency. I also thought it was a bit unnecesary but I wanted to ensure my hack worked on the CI as I don't have access to Windows/Mac machines.

@IEncinas10 IEncinas10 force-pushed the fix-windows-newline-handling branch 3 times, most recently from 300e454 to c636aa9 Compare May 15, 2025 20:06
@IEncinas10 IEncinas10 force-pushed the fix-windows-newline-handling branch from c636aa9 to 0190db8 Compare May 16, 2025 16:31
Precise control of file contents is required. This makes verible handle
files in `binary` mode, effectively disabling the platform-specific
hooks that for example translate \n into \r\n for DOS systems.

Let's introduce the line_terminator flag into BasicFormatStyle so that
users can decide which line terminator they want to use. Current options
are \n (LF) or \r\n (CRLF). LF is the default configuration.
DOS newlines (\r\n) leak the \r character into TK_EOL_COMMENT token. If
the formatter changes the newline mode from UNIX into DOS, the original
and the formatted version will be lexically different, triggering an
error inside the VerifyFormatting
@IEncinas10 IEncinas10 force-pushed the fix-windows-newline-handling branch from 0190db8 to feaaa5a Compare May 16, 2025 16:32
@hzeller hzeller merged commit d692407 into chipsalliance:master May 19, 2025
33 checks passed
@hzeller
Copy link
Collaborator

hzeller commented May 19, 2025

Thanks a lot Ignacio. Merged.
Thanks for being a regular contributor to Verible!

@IEncinas10
Copy link
Collaborator Author

IEncinas10 commented May 19, 2025

Thanks a lot Ignacio. Merged.
Thanks for being a regular contributor to Verible!

Thank you for having such a great codebase and your reviews!

Edit: I meant to edit the PR name/description as my original approach wasnt the final implementation, but I forgot. Hopefully nobody looks at the merge commit itself. Sorry about that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants