Simple webchat for server by tobi · Pull Request #1998 · ggml-org/llama.cpp

Conversation

I put together a simple web-chat that demonstrates how to use the SSE(ish) streaming in the server example. I also went ahead and served it from the root url, to make the server a bit more approachable.

I tried to match the spirit of llama.cpp and used minimalistic js dependencies and went with the ozempic css style of ggml.ai.

Initially I went for no-js dependencies but gave up and used a few minimal that i'm importing from js cdns instead of adding them here. Let me know if you agree with this approach. I needed microsoft's fetch-event-source for using event-source over POST (super disappointed that browsers don't support that, actually) and preact+htm for keeping my sanity with all this state,. The upshot is that everything is in one small html file. Speaking of- there is probably a better (and less fragile) way to include the server.html in the cpp binary, but it's been 25 years since I worked with cpp tooling.

(updated screenshot)

tobi changed the title ~~Server simple web~~ Simple webchat for server

Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this!

Initially I went for no-js dependencies but gave up and used a few minimal that i'm importing from js cdns instead of adding them here. Let me know if you agree with this approach.

I think it's good

Speaking of- there is probably a better (and less fragile) way to include the server.html in the cpp binary, but it's been 25 years since I worked with cpp tooling.

I guess it would be useful to specify the HTTP root path from the command line arguments instead of hard coding the path. But we can fix this later.

Approving and letting the "server team" take a look and merge

I think this is a good idea, but the html file should be in the binary. This will not work with the automatic builds because they don't include the contents of the examples directory.

IMHO having the js dependencies locally would be better, so it works without an internet connection, and solves the risk of malicious js.

I have done something like this before, I was serving HTML files in a C HTTP server. There was a CMake option to either build them in or to read them from disk. Reading from files is useful for development because you don't need to rebuild and restart the server. But building them in requires creating a small program that can hexdump the file into a C array definition. Overall pretty complex and then we have the Makefile as well...

using event-source over POST (super disappointed that browsers don't support that, actually)

Maybe we could add an endpoint with GET and query parameters?

requires creating a small program that can hexdump the file into a C array definition.

should be pretty simple, i don't touch Makefiles directly often, how bad are custom target?

or we ship the generated.

(more reading here https://thephd.dev/finally-embed-in-c23)

should be pretty simple, i don't touch Makefiles directly often, how bad are custom target?

I'd say Makefiles are a lot easier for this than CMake but it's just added complexity.

@tobi, How hard would it be for you to jam the HTML file contents into the .cpp file?

after some thinking, i realized that we have pure text file(s), which means we only need to pre and post-fix the file with raw string literal markers
eg:

echo "R\"htmlraw(" > html_build.cpp
cat index.html >> html_build.cpp
echo ")htmlraw\"" >> html_build.cpp

in server.cpp:

const char* html_str =
#include "html_build.cpp"
;

edit: resulting html_build.cpp:

R"htmlraw(<html></html>)htmlraw"

Ok, gave it a go (running it) and found an issue. when ever the window looses focus (switching windows), it restarts the current outputting promt. see screencap

(i switched a couple of times back and forth)

@tobi, How hard would it be for you to jam the HTML file contents into the .cpp file?

Simple enough. I was hoping to hear that there is some kind of #embed thing that works in all the cpp compilers that we care about. Crazy that it took till C23 to get that into the standard.

I can just include it. I can also just embed one dependency js and call it a day.

The best argument for keeping it in the html file is to allow people to hack on it easier. I think this could become a really good chatbot UX if we are welcome to contributors. It's got good bones 😄

#embed is not gonna work because it's too new.

Yes, it will be harder to develop, but you can also run a simple web server like with Python while developing it.

We can improve it later.

cmake script

Cool but we also have to support pure Makefile.

i feel ignored 😅
we are not dealing with binary files here so my #1998 (comment) solution is simple 3 text file concats. pretty sure it wont get much simpler :)

3 text file concats

Does it work in Windows?

3 text file concats

Does it work in Windows?

if you use make on windows, you likely also have some coreutils installed (echo and cat)

cmake has built in functions for read/write/append file :)

For me, the greatest value of this example is that it demonstrates a minimalistic way of how to implement a basic HTML/JS client that communicates with the server using just a browser without having to install node or curl. How the client is served can be solved in many different ways, depending on the needs of the specific project. I recommend to merge the example as it is and potentially add improvements from master

For me, the greatest value of this example is that it demonstrates a minimalistic way of how to implement a basic HTML/JS client that communicates with the server using just a browser without having to install node or curl. How the client is served can be solved in many different ways, depending on the needs of the specific project. I recommend to merge the example as it is and potentially add improvements from master

Agree, except we should really not hard code the path to the html. we basically ship the server, and that would look funky.

@tobi would it be too much to ask to implement the html root cli parameter for the server executable?
or for the fasttrack, if the hardcoded .html file could not be loaded (!file.is_open()) to fall back to the previous html string?

sure, i'll try to do that tonight.

OK so I did basically all of those things. There is now a --path param that you can point to any directory and static files will be served from this. I also added a deps.sh which just bakes the index.html and index.js into .hpp file (as per @Green-Sky's suggestion). So really, you can launch ./server from the llama.cpp folder and it will use ./examples/server/public directory, copy the ./server file to tmp and it will just use the baked ones, or use --path to work on your own UX.

The only downside is that we duplicate some files in git here, because of the baked .cpp files. But the deps are so small that it probably doesn't matter. It would be slightly cleaner to go and make deps.sh a build step in cmake and makefile, but... well... I ran out of courage.

@ggerganov server is in reasonably good shape overall. Maybe time for including it in the default build?

I like the current approach, with the website embedded in the binary for simplicity, but also the option to serve from a directory, to improve iteration time and to allow user customization without recompiling. It also includes the js dependencies locally.

I agree with merging this in its current state. Further improvements can be done in future PRs.

🚢

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could you move the generated files next to the source files into the public folder. If we get more files it will keep it neater to keep to the same directory structure.

server is in reasonably good shape overall. Maybe time for including it in the default build?

Yes, let's do that. Originally, I insisted to put it behind an option since it was bringing the boost library as a dependency, which is a very big burden. Now that the implementation is so self-contained and minimal, we should enable the build by default and maintain it long term

If you pull from master, the ci issues should go away.

Actually, could you move the generated files next to the source files into the public folder. If we get more files it will keep it neater to keep to the same directory structure.

that would make it possible to request the files at /index.html.cpp - still want that?

Is it chat only or is there also a text completion UI?

I tried to match the spirit of llama.cpp and used minimalistic js dependencies and went with the ozempic css style of ggml.ai.

I have nothing of use to add to this discussion but just want to point out that your use of ozempic as an adjective is blowing my mind

tobi mentioned this pull request

Jul 5, 2023

Warning: When not used over HTTP/2, SSE suffers from a limitation to the maximum number of open connections, which can be specially painful when opening various tabs as the limit is per browser and set to a very low number (6). The issue has been marked as "Won't fix" in Chrome and Firefox.

@tobi I love it! Very functional and a great addition!

I compiled it successfully on Windows (https://github.com/countzero/windows_llama.cpp) and everything works except one CLI setting: The server example does not support the --n-predict option. Is this an oversight / bug or intended?

I expected the model options of server to be symmetrical to main.

n_predict is an API request parameter. The web chat currently has a hardcoded number.

Fwiw on Windows 10 running on msys2 ucrt64 I had to add LDFLAGS += -lws2_32 (links the Windows Sockets API) in llama.cpp/Makefile to resolve building issues like this:

server.cpp:(.text$_ZN7httplib6Server24process_and_close_socketEy[_ZN7httplib6Server24process_and_close_socketEy]+0x10d): undefined reference to `__imp_closesocket'

Warning: When not used over HTTP/2, SSE suffers from a limitation to the maximum number of open connections, which can be specially painful when opening various tabs as the limit is per browser and set to a very low number (6). The issue has been marked as "Won't fix" in Chrome and Firefox.

@Green-Sky

Yeah SSE isnt that good compared to Websockets.

So you suggest to switch over to websockets? But it will be pain to implement them as the current httplib doesnt really support them (you would have to do everything on your own)

And when it's the case that a user has more than 6 concurrently open connections to llama?

Another approach could be, that the EventSource is only open while we stream the completions - so you could completly circumvent this issue :)

There is a crash if the user continues to enter text and hit return to send it while the model is streaming tokens back.

I think it is due to the lock being released in the /completion handler. It may need to be passed inside the thread that is spun off: const auto chunked_content_provider = [&](size_t, DataSink & sink) {

actually

            const auto chunked_content_provider = [&](size_t, DataSink & sink) {
                auto my_lock = llama.lock();

seems to fix this.

I don't like this. Can we please include this as text not octets. It is obfuscated.

An option is to generate the hpp file from the js plain text during compilation.

you can find more context further up (in this thread)

we should definitely add the xxd runs to the make file instead and remove them from the repo soon because the octet output will balloon the git history otherwise. I do think it's very important to bake the files into the binary though. Honestly, it's crazy that it took C and Cpp so long to add standard ways of doing this.

We did this in ASM in the 90s all the time.

we should definitely add the xxd runs to the make file instead and remove them from the repo soon because the octet output will balloon the git history otherwise. I do think it's very important to bake the files into the binary though. Honestly, it's crazy that it took C and Cpp so long to add standard ways of doing this.

We did this in ASM in the 90s all the time.

another option may be to use ld -b binary, whatever you prefer :)

xxd is not a standard tool, it is part of the vim editor. Issue is with portability. Better would be to have our own hex converter in something like Python or C.


		<div>
		<label for="nPredict">Predictions</label>
		<input type="range" id="nPredict" min="1" max="2048" step="1" name="n_predict" value="${params.value.n_predict}" oninput=${updateParamsFloat} />

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make min="-1" an option for infinity? Thanks!