Commit Graph

  • a872a2b28e
    ggml-alloc : fix discrepency between measure&eval (#2639) Shouzheng Liu 2023-08-17 03:35:53 -04:00
  • 0919a0f73d
    cmake : install ggml-meta.metal if LLAMA_METAL (#2449) Kolen Cheung 2023-08-16 21:09:49 +01:00
  • ed53db86c3
    metal : print error of load pipeline state (#2564) Jhen-Jie Hong 2023-08-17 04:09:03 +08:00
  • fc8ef549e5
    metal : enable ggml-alloc (#2627) Shouzheng Liu 2023-08-16 16:08:28 -04:00
  • bf83bff674
    metal : matrix-matrix multiplication kernel (#2615) Shouzheng Liu 2023-08-16 16:07:04 -04:00
  • b5ffb2849d
    scripts : add helper script to get wikitext Georgi Gerganov 2023-08-15 10:04:58 +03:00
  • 3ebb00935f
    server : add missing /json-schema-to-grammar.mjs (#2616) Jhen-Jie Hong 2023-08-15 06:14:14 +08:00
  • d783f7982e
    metal : return null instead of exit(1) (#2573) Jhen-Jie Hong 2023-08-14 21:37:39 +08:00
  • d75561df20
    server : add --numa support (#2524) Cheng Shao 2023-08-14 15:36:42 +02:00
  • 348acf188c
    llama : add missing enum keyword in function signatures (#2610) Kamil Tomšík 2023-08-14 15:35:16 +02:00
  • 1cd06fa25e
    CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596) Johannes Gäßler 2023-08-14 10:41:22 +02:00
  • 2feb8934eb
    server : fix default grammar by use empty string in the UI (#2604) Jhen-Jie Hong 2023-08-14 16:20:17 +08:00
  • 5517d6e692
    server : implement json-schema-to-grammar.mjs & add grammar param in the UI (#2588) Jhen-Jie Hong 2023-08-14 15:16:54 +08:00
  • f31b539714
    Enhance Windows 7 and below compatibility. (#2592) vxiiduu 2023-08-14 13:59:16 +10:00
  • ee77efea2a
    test : add simple grammar parsing tests (#2594) drbh 2023-08-13 10:00:48 -04:00
  • f64d44a9b9
    CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590) Johannes Gäßler 2023-08-13 00:24:45 +02:00
  • b19edd54d5
    Adding support for llama2.c models (#2559) byte-6174 2023-08-11 19:17:25 -04:00
  • 53dc399472
    server: fixed wrong variable name in timing json (#2579) Equim 2023-08-12 06:35:14 +08:00
  • 9ca4abed89
    Handle ENABLE_VIRTUAL_TERMINAL_PROCESSING more gracefully on earlier versions of Windows. DannyDaemonic 2023-08-10 13:11:36 -07:00
  • e59fcb2bc1
    Add --n-predict -2 for stopping generation on full context (#2565) Christian Demsar 2023-08-10 10:28:27 -04:00
  • 1638757767
    Fix grammar-based sampling issue in server (#2566) Martin Krasser 2023-08-10 12:16:38 +02:00
  • 916a9acdd0
    ggml-alloc: Don't try to re-use buffers of external tensors (#2562) Sam Spilsbury 2023-08-09 23:47:42 +03:00
  • ea04a4ca19
    add log_callback to llama_context_params for custom logging. (#2234) grahameth 2023-08-09 22:46:40 +02:00
  • 25d43e0eb5
    CUDA: tuned mul_mat_q kernels (#2546) Johannes Gäßler 2023-08-09 09:42:34 +02:00
  • f5bfea0580
    Allow passing grammar to completion endpoint (#2532) Martin Krasser 2023-08-08 15:29:19 +02:00
  • acfc5478ff
    CUDA: tighter VRAM scratch size for 65b/70b (#2551) Johannes Gäßler 2023-08-08 14:38:16 +02:00
  • 7ed8d1fe7f
    llm.vim : multiline autocompletion, get rid of "^@" (#2543) chaihahaha 2023-08-08 20:07:02 +08:00
  • e7f94d6fdc
    vim : bring back simple llm.vim example Georgi Gerganov 2023-08-08 15:05:30 +03:00
  • 2d7baaf50f
    vim : streaming and more (#2495) AustinMroz 2023-08-08 06:44:48 -05:00
  • f3c3b4b167
    Add --rope-scale parameter (#2544) klosax 2023-08-07 19:07:19 +02:00
  • 93356bdb7a
    ggml : mul mat tweaks (#2372) Georgi Gerganov 2023-08-07 14:25:58 +03:00
  • 60baff7c85
    ggml : pad result of ggml_nbytes() Georgi Gerganov 2023-08-07 14:24:42 +03:00
  • 9082b5dfbf
    ggml : change params pointer (style change) (#2539) Georgi Gerganov 2023-08-07 13:55:18 +03:00
  • 99d29c0094
    ggml : sync (custom ops) (#2537) Georgi Gerganov 2023-08-07 13:20:09 +03:00
  • 3d9a551816
    Fixed mmap prefetch for GPU offloading (#2529) Johannes Gäßler 2023-08-07 10:09:40 +02:00
  • f6f9896ac3
    metal : fix out-of-bounds access + inc concurrency nodes (#2416) Georgi Gerganov 2023-08-07 10:52:57 +03:00
  • 34a14b28ff
    [Makefile] Move ARM CFLAGS before compilation (#2536) GiviMAD 2023-08-06 23:21:46 -07:00
  • 7297128db8
    [Zig] Rewrite build for Zig 0.11 (#2514) Henri Vasserman 2023-08-07 08:35:53 +03:00
  • 86c3219895
    console : fix issue related to Windows 11 PowerShell console mode persistence (#2521) DannyDaemonic 2023-08-05 23:49:34 -07:00
  • 2e8265ae17
    convert.py : add missing abstract methods for quantized data (#2491) Keiichi Tabata 2023-08-06 15:34:05 +09:00
  • f514d1b306
    CUDA: faster k-quant mul_mat_q kernels (#2525) Johannes Gäßler 2023-08-05 18:20:44 +02:00
  • 332311234a
    fix firefox autoscroll (#2519) Jonas Wunderlich 2023-08-04 20:16:11 +00:00
  • 182af739c4
    server: regenerate completion.js.hpp (#2515) Cebtenzzre 2023-08-04 15:00:57 -04:00
  • 4329d1acb0
    CUDA: use min compute capability of GPUs actually used (#2506) Cebtenzzre 2023-08-04 11:35:22 -04:00
  • 02f9d96a86
    CUDA: check if event is NULL before cudaStreamWaitEvent (#2505) Cebtenzzre 2023-08-04 11:34:32 -04:00
  • 3498588e0f
    Add --simple-io option for subprocesses and break out console.h and cpp (#1558) DannyDaemonic 2023-08-04 08:20:12 -07:00
  • 5f631c2679
    Fixing race condition in server and partial stream handling in frontend. (#2391) Stephen Nichols 2023-08-04 06:37:24 -05:00
  • 415e99fec2
    Stream save llama context data to file instead of allocating entire buffer upfront (#2488) l3utterfly 2023-08-04 19:29:52 +08:00
  • ff966e7ca6
    build : fix several cast and printf warnings (#2499) Borislav Stanimirov 2023-08-04 13:07:21 +03:00
  • 8183159cf3
    examples : generate JSON according to schema (#1887) Evan Jones 2023-08-02 22:05:44 -04:00
  • 468ea24fb4
    CUDA: faster non k-quant mul_mat_q kernels (#2483) Johannes Gäßler 2023-08-02 18:04:04 +02:00
  • 4f6b60c776
    CUDA: Fix models with output size != 32000 (#2480) Johannes Gäßler 2023-08-02 16:48:10 +02:00
  • 220d931864
    readme : add Aquila-7B model series to supported models (#2487) ldwang 2023-08-02 16:21:11 +08:00
  • 81844fbcfd
    tests : Fix compilation warnings (Linux/GCC) (#2451) Eve 2023-08-02 04:06:19 -04:00
  • a312193e18
    readme : Add Chinese LLaMA-2 / Alpaca-2 to supported models (#2475) Yiming Cui 2023-08-02 14:18:31 +08:00
  • c574bddb36
    fix a typo in examples/server/README.md (#2478) Bono Lv 2023-08-01 20:54:28 +08:00
  • 86aeb27734
    server : Support dark mode (#2414) ebraminio 2023-08-01 01:56:23 -07:00
  • 1873ff586b
    metal : add gqa8 kernel to allow llama-2-70B on metal (#2459) Matteo Boschini 2023-08-01 09:43:12 +02:00
  • 49e7cb5bb1
    CUDA: fixed LLAMA_FAST compilation option (#2473) Johannes Gäßler 2023-07-31 21:02:19 +02:00
  • b772bba42e
    CUDA: fixed cmake F16 option (#2471) Johannes Gäßler 2023-07-31 19:52:22 +02:00
  • 0728c5a8b9
    CUDA: mmq CLI option, fixed mmq build issues (#2453) Johannes Gäßler 2023-07-31 15:44:35 +02:00
  • 1215ed7d5c
    CUDA: Implemented row flattening for non-glm RoPE (#2468) Johannes Gäßler 2023-07-31 14:32:30 +02:00
  • 2dbf518911
    CUDA: fewer memory bank conflicts for mul_mat_q (#2458) Johannes Gäßler 2023-07-31 13:18:51 +02:00
  • 9d2382b3e4
    Fix Metal backend broken from the allocator changes (#2455) slaren 2023-07-31 11:02:53 +02:00
  • a113689571
    ggml : add graph tensor allocator (#2411) slaren 2023-07-30 15:58:01 +02:00
  • 11f3ca06b8
    CUDA: Quantized matrix matrix multiplication (#2160) Johannes Gäßler 2023-07-29 23:04:44 +02:00
  • 9baf9ef304
    CUDA: faster multi GPU synchronization (#2448) Johannes Gäßler 2023-07-29 23:04:10 +02:00
  • 8a88e5855c
    perplexity : add Hellaswag calculation (#2389) klosax 2023-07-28 20:25:36 +02:00
  • a9559bf77b
    ggml : workaround for missing _mm256_setr_m128i in GCC < 8 in k_quants.c (#2405) Lee 2023-07-29 02:17:45 +08:00
  • ee1b497c98
    llama : support more diverse tokenizers? (#2420) eric8607242 2023-07-29 02:10:05 +08:00
  • d73b8d48b4
    examples : fix whitespace Georgi Gerganov 2023-07-28 21:05:08 +03:00
  • 34ae1caf7f
    examples : server chat mode with llama2 (#2400) nhamanasu 2023-07-29 03:02:10 +09:00
  • d91f3f0c55
    readme : fix the description of the Tail free sampling (TFS) method (#2431) Weird Constructor 2023-07-28 10:44:43 +02:00
  • 65cdf34bdc
    llama : use n_embd_gqa instead of n_embd to handle llama-2 70B (#2433) Rand Xie 2023-07-28 01:42:53 -07:00
  • edcc7ae7d2
    Obtaining LLaMA 2 instructions (#2308) niansa/tuxifan 2023-07-28 03:14:11 +02:00
  • 7c529cede6
    convert.py : Update to support 70B HF format model files (#2427) mj-shifu 2023-07-27 22:39:17 +02:00
  • 1a941869cb
    metal : disable graph concurrency optimization due to bug (#2413) Georgi Gerganov 2023-07-27 11:00:54 +03:00
  • b5472ea0ad
    ggml : fix assert in ggml_set_unary_op (#2410) slaren 2023-07-26 23:57:23 +02:00
  • 6df1f5940f
    make : build with -Wmissing-prototypes (#2394) Cebtenzzre 2023-07-26 14:00:04 -04:00
  • 5488fb789e
    ggml : allocate graphs in a context (#2392) slaren 2023-07-26 15:56:53 +02:00
  • eb542d3932
    Add LLAMA_DEFAULT_RMS_EPS so we can change the default (#2384) Kawrakow 2023-07-25 18:35:53 +03:00
  • 07aaa0f63f
    ggml : fix ggml_flash_attn to use op_params (#2387) slaren 2023-07-25 16:20:12 +02:00
  • fce48caf9a
    convert.py : support bpe tokenizer (#2228) ldwang 2023-07-25 21:22:09 +08:00
  • 875086bdb9
    ggml : relax contiguous constraints in activation function (#2371) Jiahao Li 2023-07-25 20:58:32 +08:00
  • da1889834a
    ggml : improve graph build time via hash table lookup (#2329) slaren 2023-07-25 14:32:20 +02:00
  • 82552b7f54
    build : fix line breaking error in build-info.sh (#2349) Hesen Peng 2023-07-25 05:24:09 -07:00
  • 0c06204fb3
    main : add --in-prefix-bos to prefix BOS to user inputs; keep EOS (#2304) Xiao-Yong Jin 2023-07-25 07:19:11 -05:00
  • 1fed755b1f
    ci : add non-AVX scalar build/test (#2356) Eve 2023-07-25 08:16:13 -04:00
  • be2301bcda
    k_quants : add AVX support to dot functions with QK_K as 64 (#2339) katsu560 2023-07-25 21:13:41 +09:00
  • 1aa18ef994
    metal : concurrently dispatch commands (#2358) Shouzheng Liu 2023-07-25 08:00:19 -04:00
  • 9a08eaf3c4
    Another speed gain for Q4_0 and Q4_1 on Metal (#2375) Kawrakow 2023-07-25 13:48:29 +03:00
  • 129d844c87
    Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359) Kawrakow 2023-07-25 13:48:04 +03:00
  • d5512b782b
    server: add rms_norm_eps parameter (#2380) slaren 2023-07-25 11:36:17 +02:00
  • c798308e3a
    [Server] Escape HTML in webchat (#2368) Henri Vasserman 2023-07-25 10:27:34 +03:00
  • 41c674161f
    make rms_norm_eps a parameter (#2374) slaren 2023-07-24 17:57:12 +02:00
  • b3f138d058
    Chat UI extras (#2366) Aarni Koskela 2023-07-24 17:54:22 +03:00
  • 5b2b2dc6ae
    ggml : sync (unary ops refactor, static-correctness) (#2370) Georgi Gerganov 2023-07-24 14:46:21 +03:00
  • 42f70cb2f6
    Fix scalar version of Q5_K when QK_K = 64 (#2362) Kawrakow 2023-07-24 12:55:02 +03:00
  • 84e09a7d8b
    llama : add grammar-based sampling (#1773) Evan Jones 2023-07-23 23:58:10 -04:00
  • 2f9cf974a0
    Some more Q4_K and Q5_K speedup on CUDA (#2346) Kawrakow 2023-07-24 00:19:47 +03:00