Fabien Benetou's PIM | Content / SelfHostingArtificialIntelligence

Motivation

Artificial Intelligence (AI) is often presented like a complex field, the state of the art being impossible to understand, models too large to train, incredible work in progress moving forward that could change anything, yet a black box inscrutable for anyone except the selected few.

This is truly damaging to the field as it is a fascinating topic and even though indeed nobody can understand it all, we can all benefit from tinkering with it, learning from it and possibly even using it.

It's also often showcased as impractical for a "normal" person with their "normal" computer. Consequently everything must be done in the "cloud", far away from our scrutiny.

This is also damaging as it is simply false. Some very large models are indeed too large to run on a single computers but most, including what was considered the state of the art just a couple of years ago, can be run locally. In fact the trend to scale might be problematic for the entire field.

Regardless of all those limitations the goal here is to showcase that even though not everything can be done on your desktop, a lot can. Composing from that and learning how it works can help to reconsider a potential feeling of helplessness.

Not only can you self-host AI models, use them, adapt them, but there is a whole community and set of tools to help you do so. This movement itself is very encouraging. AI does not have to be a black box. Your digital life does not have to be owned by someone else, even for the state of the art.

PS: this is also aligned with my own naive heuristic explicited since 2020 : avoiding gadgets or services (free or not) that increase inequality by design, through technology or business model or both, would be a good starting point.

See also AgainstPoorArtificialIntelligencePractices for a less technical piece with 5 simple recommendations.

Prompted by recent (July 2024) news on Microsoft and Google completely busting their own energy goals due to AI.

Requirements

Software

Familiarity with self-hosting, e.g Linux command line, see Shell, and containerization, e.g Docker. Ideally familiarity with Python the most popular work in AI is currently often done in that Programming language.

Hardware

A desktop with a proper graphic card is recommended. Some solutions do not need it all while others need the last generation of GPU. It is also possible to rent such a configuration in the cloud if necessary, while insuring that the cloud provider has terms of services and overall practices aligned with your needs.

Ideal

Linux desktop with latest generation NVIDIA GPU, Docker installed and running with NVIDIA support. Not that this can be done rootlessly to insure a bit more safety. Overall do remember to backup your data regardless of what you are trying.

Notes on integration

Using Telegram bot I am able to query models from any device while my desktop is turned on, e.g here llama.cpp running Mistral https://twitter.com/utopiah/status/1720122249938628951 , and consequently considering a local first (no 3rd party relaying messages) via https://git.benetou.fr/utopiah/offline-octopus/issues/22

This way I can for example generate text from my mobile phone, being on the same network or not.

Chaining example of speech-to-text, LLM then text-to-speech for a "natural" (albeit slow) kind of "hands-free conversation" https://twitter.com/utopiah/status/1720475902218317930

This could also be done behind a VPN or relying on Tailscale without using Telegram or any chat program. For now using a chat porogram makes sharing on a mobile phone a lot more convenient.

On the cost of training

I had several heated discussions on social networks (cf e.g https://lemmy.world/post/30563785/17396851 ) sparked recently by someone who didn't know the training cost (fair, we can't always know how everything we played with was made) but more importantly, did not care.

That was shocking to me. There are countless articles explaining that "AI", however you define it, however you might believe its impact might be, has a very tangible ecological impact. It takes a lot of connected computers with very energy consumings parts, GPUs, TPUs, etc, and a lot of water to cool them down.

What is even more important IMHO is that training is precisely the part of the entire process, between research, programming, training, fine-tuning, inference, etc that consume the most of energy. Assuming it does take about an order of magnitude more than the other steps, saying one does not "care" for it means they do not care for their environment.

I have a hard time believing that is possible so I assume that what they mean is that training is basically insignificant because it's offset by positive impact.

In order to betwee estimate that, potentially one could try to factor in

how much energy it took to train it (sometimes on model cards, but not always)
how many people use it (potentially data from aggregated public downloads as upper bound)
how long they use it (potentially until the replacement model comes, as upper bound also)

The hope here would be that to have few basic rules to faciliate chosing between equivalent models, or also potentially deciding not to train models that will not be used enough to offset their costs.

Finally, the heuristic would be :

do not use a model without knowing its training cost
compare equivalent models in terms of performances, if alternatives exist, prefer the one that used the less amount of resources for training
if you must train a model, prior to train it estimate its energy cost and potential usage

Note that the argument is for future models and current models. Even though we can not get the energy back for trained models, selecting the most efficient one, energy-wise, might help set a trend for the training of future models.

There are already quite a few initiatives for that with e.g. coffee with Fair Trade Certification or ISO 14001, in electronics Fair Materials, etc.

The point being that there are already mechanisms for feedback in other fields and in ML there are already model cards with a co2_eq_emissions field, so why couldn�t feedback also work in this field?

Used locally

NLP (NLTK)
face tracking (OpenCV)
- PostureMaintenance
OCR (Tesseract)
- cf own KDE shortcut using it in combination with spectable, cf https://lemmy.ml/comment/8124650
- see also Eink
..? (TensorFlowJS)
VR and AI workshops with Yannel
- done over Glitch
face tracking (Human)
live deep fake (avatarify)
green screen (OBS plugin)
swarm robotics simulation (Argos)
- remote simulations with caching as Observable notebook
STT (vosk, deepspeech, long before that PockerSphinx)
image to 3D avatar (PifHUD)
video understanding (PySlowFast)
translation (mozilla, OpenNMT via LibreTranslate)
edge computer vision (OAK-D)
segmentation (DIS)
tf-idf (gensim)
in browser NLP (nlp_compromise)
text-to-image (min(DALL�E), Stable Diffusion)
summarization (bart-large-cnn)
alpaca.cpp (chat based on LLaMA)
- also on a standalone VR headset via Termux
bloomz.cpp (completion relying on BLOOM)
SantaCoder (code generation built from The Stack)
CoquiTTS (to test HuggingFace Spaces deploy to Docker)
- reddit post with limitations
- use tts_models/fr/mai/tacotron2-DDC for French voice generation for an XR pedagogical small game
  - result https://fabien.benetou.fr/pub/home/future_of_text_demo/content/voicesBigguJulia/
  - 1-liner docker run --rm -v ~/tts-output:/root/tts-output -v ~/.tts-models/:/root/.local/share/tts/ ghcr.io/coqui-ai/tts-cpu --text "Allez, on recommence." --out_path /root/tts-output/hello.wav --model_name tts_models/fr/^Ci/tacotron2-DDC with volumes for output and to preserve models (faster synthesis time)
turbocopilot (code generation relying on llama.cpp)
- integrated back in Vim, based on previous santacoder in Vim)
whisper.cpp (STT to generate caption, e.g https://twitter.com/utopiah/status/1649332186233950208 )
GPT4All-J via Nomic AI chat UI Installer
shape-E
- thanks to https://ngwaifoong92.medium.com/introduction-to-shap-e-text-to-3d-a4fb5304642b
- converting to obj then using `obj2gltf` to get the model for WebXR
  - could rely on trimesh to convert to glTF, see https://trimsh.org/trimesh.exchange.gltf.html after https://trimsh.org/trimesh.exchange.load.html
- could provide a Web API using Flask for in VR generation
  - generating 4 latent objects in .ply takes 1min on a 2080ti, 1 takes 30s, 10 takes 2min
Immich
- for its https://github.com/immich-app/immich/tree/main/machine-learning
- see also https://www.digikam.org/news/2023-12-03-8.2.0_release_announcement/
Mistral, via mistral-7b-v0.1.Q5_K_S.gguf for llama.cpp
- integrated with XR https://twitter.com/utopiah/status/1713914092552089675 via llama.cpp server
GraphHopper for large scale routing, via Docker image and OpenAPI endpoint
- integrated with XR https://twitter.com/utopiah/status/1717176146981470249 via ngrok and jxr.js
Latent Consistency Models https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7
- 6s on 2080ti https://twitter.com/utopiah/status/1719513868882342082
code generation using Phind-CodeLlama-34B-v2 using its GGUF Q4 for llama.cpp
- a minute or so for a 10 lines JS snippet https://twitter.com/utopiah/status/1719763524983877744
BGE_M3 embedding for "Multi-Functionality, Multi-Linguality, and Multi-Granularity"
- via https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05
- https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text
- results from this very wiki as ~2Mb embeddings (pickle) querying in ~.1s https://twitter.com/utopiah/status/1766838354438471970
- cf ~/Prototypes/FlagEmbedding/wikidata.py
- consider also https://www.meilisearch.com/docs/learn/experimental/vector_search#generate-vector-embeddings
StarCoder2-3B
- tried locally float16 https://huggingface.co/bigcode/starcoder2-3b#running-the-model-on-cpugpumulti-gpu
Orca Mini v3 7B
- curious about Orca Math https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k but not yet available
- orca_mini_v3_7b.Q5_K_M.gguf from https://huggingface.co/TheBloke/orca_mini_v3_7B-GGUF via llama.cpp
TripoSR, 3D reconstruction from a single image
- https://github.com/VAST-AI-Research/TripoSR/
CroissantLLM, specifically for French/English capabilities for its limited size
- https://huggingface.co/blog/manu/croissant-llm-blog
tlm "Local CLI Copilot, powered by CodeLLaMa." https://github.com/yusufcanb/tlm
- related to down idea on embeddings of man pages, cf ~/Prototypes/find-combinable-brick/gensim-tfidf.py (May 2023)
- relying on CodeLLaMa
- own test : tlm suggest a command to get my ip as query returned uname -a
  - which is not correct, explanation also wasn't correct
OpenFunctions/Gorilla
- cf https://gorilla.cs.berkeley.edu/blogs/4_open_functions.html
  - via RAFT
- relying on gorilla-openfunctions-v1-q2_K.gguf
LocalAI
- cf https://github.com/mudler/LocalAI/ but unfortunately numerous problems
  - missing container latest tag, making documentation examples broken
  - AIO image did download a bunch of models yet didn't work (grpc issue)
  - seems there is only APIs without even an index.html page clarifying how to use it
Llama 3
- tried 8B (not 70B) via https://ollama.com
  - cf https://twitter.com/utopiah/status/1781560078883061825
  - also tried the REST API https://github.com/ollama/ollama?tab=readme-ov-file#rest-api
    - very convenient, doc in https://github.com/ollama/ollama/blob/main/docs/api.md
      - e.g "stream": false
    - see the JS client https://github.com/ollama/ollama-js
Vosk-browser for STT on Oculus
- https://git.benetou.fr/utopiah/text-code-xr-engine/src/branch/stt
- using https://github.com/ccoreilly/vosk-browser/tree/master/examples/modern-vanilla
wllama
- WASM with interface, high level and low level, able to load GGUF straight from HuggingFace
- https://github.com/ngxson/wllama?tab=readme-ov-file#simple-usage-with-es6-module
  - embeddings demo https://github.ngxson.com/wllama/examples/embeddings/
  - esm https://www.npmjs.com/package/@wllama/wllama?activeTab=code
- WebXR demo https://x.com/utopiah/status/1803420715732861427
Gemini nano on Chrome Canary as window.ai object
code generation with Replete-Coder-Llama3-8B
- using GGUF from HuggingFace and run using llama-cli -cnv
sumy for summarization
- via Docker https://github.com/miso-belica/sumy yet able to get directly a text from a URL
OpenHands (formelly OpenDevin) https://www.all-hands.dev
- didn't manage to make it run with Ollama locally though
deepseek-coder-v2
- tested via ollama on an AFrame component example, same usual mistake i.e being about Web (mouse) events but not about XR events
Stable Diffusion 3 Medium (using sd3_medium_incl_clips_t5xxlfp8.safetensors )
- https://github.com/comfyanonymous/ComfyUI
  - see also https://github.com/Comfy-Org/comfy-cli
Crawl4AI
- tested on fabien.benetou.fr
  - relying on the DockerHub image
  - didn't setup any LLM
  - relies on litellm https://github.com/BerriAI/litellm so can use OpenAI, Grok, Ollama, etc
GOT-OCR2.0 : General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
- https://github.com/Ucas-HaoranWei/GOT-OCR2.0/
- tests https://x.com/utopiah/status/1834588492854092081
- more than 5Gb of setup (from venv dependencies, model itself being 1.4Gb)
- doesn't seem radically better/worst than tesseract
  - only tried on couple of examples though, didn't try fine-grained nor multi-crop OCR
moshi
- unfortunately OOM https://github.com/kyutai-labs/moshi/tree/main/moshi
- downloads a 15Gb safe tensors model, just a bit too big for a 2080ti (12Gb of VRAM)
- did try weeks ago during the announcement via their online demo
DepthAI with Oak-D Lite
- hardware https://docs.luxonis.com/hardware/products/OAK-D%20Lite
  - 4 TOPS of processing power (1.4 TOPS for AI) using RVC2 https://docs.luxonis.com/hardware/platform/rvc/rvc2/
- depthai_hand_tracker https://github.com/geaxgx/depthai_hand_tracker/
- DepthAI Viewer https://docs.luxonis.com/software/tools/dai-viewer/
surya OCR https://github.com/VikParuchuri/surya
BirdNet/Raven
- birdnetlib https://joeweiss.github.io/birdnetlib/#using-birdnet-analyzer
- testing against https://www.allaboutbirds.org/guide/House_Finch/sounds
- https://github.com/kahst/BirdNET-Analyzer
- found via https://www.birdweather.com/birdnetpi
- used before on phone Merlin Bird ID, also from Cornell
moondream as a "tiny" vision model
- tested remotely as https://huggingface.co/spaces/vikhyatk/moondream2
- tested locally using transformers via https://github.com/vikhyat/moondream#usage
  - 18s on 2080ti
  - requires 5.5Gb of dependencies, 3.7Gb model
  - cloning and run via Gradio did not work, neither via CUDA nor --cpu
llama3.2-vision
- very slow, e.g. 3min per image
- https://ollama.com/blog/llama3.2-vision
  - trouble passing base64 images
    - demo does work though, but smaller one, no cURL problem
omni-vlm-version: vlm-81-instruct
- via next (alternative to ollama, etc)
  - few setup bugs
    - https://github.com/NexaAI/nexa-sdk/issues/248
    - https://github.com/NexaAI/nexa-sdk/issues/249
- didn't seem to work
  - https://x.com/utopiah/status/1858440861744238764
deepseek-r1:1.5b
- smallest model available via ollama today, not bad at all for 1.1Gb
- few basic tests (basic arithmetic, programming in Javascript, argumentation) : one of the "least worst" smaller model I tried.
OAK-D Lite on RPi as computer vision on the edge
- e.g. ~/Prototypes/oakd-lite/bin/python3 ~/Prototypes/oakd-lite/depthai-python/examples/ColorCamera/autoexposure_roi.py
- details https://bsky.app/profile/benetou.fr/post/3lqjyvpm2ek2j
- on training itself
  - https://github.com/luxonis/depthai-ml-training/tree/main/training#-training-tutorials
  - conversion https://docs.luxonis.com/software/ai-inference/conversion

Trying to insure combinatoriality/compositionality

Short scripts in ~/bin to make the different tools and their result to combine more conveniently.

stt: using Whisper.cpp to convert an audio file to text
screenocr: capture part of the screen then OCR the result then Web search
- demo https://twitter.com/utopiah/status/1453019635465572352
ocr-to-wikiembeddings: capture part of the screen then semantic search it
monitor-voice-via-whispercpp-stream: using Whisper.cpp interatively
get-pages-from-embeddings: returns top10 wiki pages after embedding
santacoder: complete a code prompt via HuggingFace API and clean output

Relying on

url-to-text: uses readibility to get text content from a URL
yt-dlp to get online videos with optionally Ffmpeg to get audio for transcription
- note that yt-dlp can also directly download subtitles, including automatically generated subtitle (automatic captions)
  - as demonstrated in older FoT demo to search through entire set of videos https://twitter.com/utopiah/status/1714262670860263550

Used non locally but still open

SantaCoder in SpaSca (XR) https://twitter.com/utopiah/status/1621760472461594624
article summarization using bart-large-cnn https://twitter.com/utopiah/status/1569917194598862849

example as tweets

Remarks

Suppose text input. including dataAsURI
model card to CO2 equivalent
- https://twitter.com/utopiah/status/1562398263495593984
beyond "just" hosting https://hacks.mozilla.org/2023/11/mozilla-ai-guide-launch-with-summarization-code-example/
fundamental problems according to research
- No �Zero-Shot� Without Exponential Data https://arxiv.org/abs/2404.04125
- ChatGPT is bullshit https://link.springer.com/article/10.1007/s10676-024-09775-5

Typical process

discover news about AI
live demo availability, if so try with own data
open-source repository
models availability
image availability, as Docker or replicate (cog)
1. otherwise make Dockerfile
  1. very often relying on Anaconda3
try locally
if out of memory (e.g OOM error with Torch) look for slimmed down version
if no Web interface available, use Python or NodeJS to make a basic interface
build demo using the Web API, integrating with other tools, e.g WebXR

Tooling

generic
- Docker
- notebooks
  - Jupyter
  - Observable
    - https://observablehq.com/d/b92fecbba00dbe31
dedicated
- cog
- Gradio
  - see https://observablehq.com/@utopiah/hf-gradio-discovery
- HuggingFace Transformers with its pipeline
- NLTK with its .download() to directly get models and datasets
- torch
- tensorflow
- jax

Python Flask template

Often models and inference comes from Python code. Using Flask provides an HTTP (or even HTTPS) endpoint that makes it easy to integrate with a frontend, e.g a WebXR page.

# consider a venv then ./bin/pip3 install flask uuid
# res is the result from a Python function, any inference here with models already downloaded.
import uuid    
from flask import request, Flask
app = Flask(__name__)

from flask import request
@app.route('/toimage/<prompt>', methods=['GET'])
def login(prompt):
    myuuid = uuid.uuid4().hex
    res[0].save('./static/'+myuuid+'.jpg','JPEG')
    return {'prompt': prompt, 'url': '/static/'+myuuid+'.jpg'}

@app.route('/totext/<query>', methods=['GET'])
def gen(query):
    return {'prompt': query, 'top10': res}

if __name__ == '__main__':
    print('can then expose as https via e.g ngrok http 5000')
    app.run()

Note this could also use Gradio instead.

Constrained model pathfinder

starting locally then server then cloud
- cascading : [url1, url2, url3]
usage base, eg try to use generic terms from model cards and if not fallback to model names
- as done in https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline
Telegram bot as mobile interface
cloud-init Docker https://stackoverflow.com/questions/24418815/how-do-i-install-docker-using-cloud-init/62540068 that could point to a specific Dockerfile URL or to an existing image on official Hub image or nvcr.io, etc
can be prioritizing
- locality for privacy
- remote for CO2 equivalence
- a composite based on user weighting

Adversarial protections

poisoning artwork against non authorized training
- https://nightshade.cs.uchicago.edu
- https://glaze.cs.uchicago.edu/
blocking access
- https://darkvisitors.com
- https://github.com/ai-robots-txt/ai.robots.txt
https://kudurru.ai/ "Actively block AI scrapers from your website with Spawning's defense network"

Training followed

university training
- AI** at UTC
  - https://webapplis.utc.fr/uvs/
MOOCs
- AIClass
- MLClass
- Coursera
  - Sequences, Time Series and Prediction
  - Natural Language Processing in TensorFlow
  - Convolutional Neural Networks in TensorFlow
  - Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning
  - Computational Molecular Evolution
- Dataquest.io
  - Data Cleaning Course
  - Data Science Projects Course
  - Data Visualization Course

To try

on toolformers
- https://github.com/cognitiveailab/neurosymbolic
- https://github.com/minosvasilias/toolformer-zero
nanoGPT https://github.com/karpathy/nanoGPT
minGPT https://github.com/karpathy/minGPT/
on mobile, e.g Docker on the PinePhone Pro
- camera is not well supported
- ARM architecture
- e.g OCR on camera output
- could start with transformers with a specific model to limit dependencies
https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
https://www.oreilly.com/library/view/machine-learning-design/9781098115777/
ts_server https://bellard.org/ts_server/
https://swethatanamala.substack.com/p/how-i-ran-llms-on-steam-deck-handheld
with end user interfaces
- https://lmstudio.ai
  - also with API https://lmstudio.ai/docs/local-server
    - including JS https://github.com/lmstudio-ai/lmstudio.js
- https://pinokio.computer
red teaming
- basic introduction from Microsoft https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/red-teaming
- HF blog post from 2023 https://huggingface.co/blog/red-teaming
- Recent survey on GenAI from 2024 https://arxiv.org/abs/2404.00629
Exo : "Run your own AI cluster at home with everyday devices" https://github.com/exo-explore/exo
LLM Explorer https://llm.extractum.io
- with categories like "LLMs Fit 16GB VRAM"
dev env
- https://github.com/zed-industries/zed
- https://github.com/voideditor/void

Self Hosting Artificial Intelligence {Content}