arisuchan    [ tech / cult / art ]   [ λ / Δ ]   [ psy ]   [ ru ]   [ random ]   [ meta ]   [ all ]    info / stickers     temporarily disabledtemporarily disabled

/λ/ - programming

structure and interpretation of computer programs.

formatting options

Password (For file deletion.)

Help me fix this shit.

Kalyx ######

File: 1561302657488-0.jpeg (78.86 KB, 1520x1080, lain-code-1.jpeg)


Everything C++: What you like/dislike about it, what you'd like to improve. Of course, also post your C++ projects.

I really like C++ because it is powerful, yet allows very intricate control over what the machine does. I think that efficiency and adaptability are the most important features of code, aside from cleanliness. What I don't like about it is that the syntax is not parsable by push down automata (so it's not context-free).

, a coroutine library I have written in C++.
It achieves true M:N threading, and I spent a lot of time on the atomics to try to make the scheduling as efficient as possible.
I benchmarked it, and its context switching and scheduling is faster than Rust's and Golang's coroutines (>1.5x) / goroutines (>20x @ 12 cores), so I am quite proud of it. Because of efficiency reasons, coroutines do not have their own stacks, so scoping variables and native exceptions are not supported. But the syntax ended up being nice enough to not be a nuisance, I think (see the docs/

Let's all love Lain!


Nice one OP.

I love C++ for its metaprogramming and integration into C libraries and the ability to emit assembly and simd intrinsics which can get me up to x30 speedups at times depending on my task.

I'm really looking forward to Icelake for its popularization of AVX512.


I'm learning C :)


I don't like the way you have to metaprogram in C++ though. I think it'd be much nicer to have an imperative-style metaprogramming instead of functional. I think that the preprocessor should be replaced with a more powerful language that is more suited for advanced code generation. If I had such a language, I could implement stuff such as coroutines much better. I think that if you replaced the preprocessor with a proper metaprogramming language, compilation times would actually decrease, as every file would only have to be compiled once (as currently, every file has to be processed every time it is included, as the preprocessor makes it nondeterministic (you can always change the behaviour of a file by redefining some identifier)). Maybe I should pick up my custom language compiler project again^^
>C and assembly
Yeah, it's a really nice thing to be able to tweak and tweak until you have the maximum performance. I think it's also my obligation to make my program consume as little energy as needed.
>I'm learning C :)
That's cool, I wish you best of luck on that. Learning C gives you a good understanding of how computers work (of course, you can always go deeper ;)). However, I recommend you to learn C++ after having the basics of C down. C++ is more complicated, yes, but once you learn it, it saves you a lot of work (primarily templates and constructors/destructors). Its type system is superior, and you can save a lot of duplication by using templates. Constructors and destructors will help you prevent leaking memory. And, of course, you have many common container types at your disposal already through the STL.


File: 1563749139967.png (1.79 MB, 1440x1080, vlcsnap-2019-07-10-03h09m4….png)

>I think it's also my obligation to make my program consume as little energy as needed.

Amen, that's also how I try to work. nowadays, it seems like people just don't give a soykaf if their algorithm run at O(n²). "I have a quad core, what's the point?"

Computers don't have infinite memory nor are they infinitely powerful.


File: 1563759152715.png (1.01 MB, 1224x1580, tumblr_per2yeiQ4G1sfpssdo1….png)

I got to a point in my programming where once all the bigger pieces of my code base stay still, I start to optimize to make things as efficient as possible using every trick in the book. Stuff like SIMD intrinsics, vectorization, using Memory Mapped IO when I know a file is going to be read from randomly a lot. Etc.

I'm at a point where I have a philosophy that every algorithm's "end-game" is to be implemented in hardware. Similar to how the inverse-square root code ultimately became implemented in hardware.

I'm pretty excited for consumer AVX512 coming in Icelake for what all it does to the x86 code base. Not for the register width, but for the fact that it adds a uniform "refresh" to the entire SSE stack.


check this soykaf out
Some beautifully hand-crafted SIMD code mixed in with some clever type-aware metaprogramming in C++ can provide some really huge speedups and efficiencies that can't be ignored while also being pretty expressive at a high level.


As Scott Meyers put it C++ is a federation of languages/different idioms. There is no single C++, although "Modern C++" idiom is the new "pythonic" of C++ domain. I don't like Modern C++ because at that point where I have runtime library dependency I am just going to use Rust or Go. I like C++ because it allows me to use C/C++ idiom where it is especially useful in system programming. Finally I think backwards compatibility ruined C++ like many things, C++ 11 had to be an entirely new language because the language feels so bloated right now.


how can a language feel "bloated" when none of its features are forced on you? You can still program like its C++98 if you wanted to. You only have to reach out to the C++ features that you care to use. No one is forcing you to use unique_ptr or std::transform. While rust literally FORCES you to program within what is basically a subset of C++ where 37 people in San Francisco get to decide how you program and what paradigms are available.


How do you measure energy consumed? Or do you just go by "less cycles = less energy"?


generally memory access is more energy-expensive than clock cycles


File: 1563855935258.png (362.37 KB, 640x830, e0bc421234e170af.png)

sick lib bro


What about the SIMD engine? Does its power consumption depend on usage? I have no idea how these things work but you made me curious.


SIMD is an overall efficiency measure by getting your algorithm closer to a near-FPGA implementation rather than having a more expansive tautologically equivalent course of instructions.

If a single specialized SIMD or bit manipulation instruction or whatever can make you complete your task in lower uops and clocks then yea you generally use a lot less power.

If you use AVX512 to process 64 bytes of data in parallel then you bet its going to be a lot more efficient than picking up each byte and processing it serially.

If you use a single BSWAP instruction to swap the endian of a 32-bit integer then you bet it's gonna be a lot more efficient than manually doing all the bit shifts and bitwise-OR it would take to do it manually.


Swap64(unsigned long):
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], rdi
        mov     rax, QWORD PTR [rbp-8]
        sal     rax, 56
        mov     rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        sal     rax, 40
        mov     rcx, rax
        movabs  rax, 71776119061217280
        and     rax, rcx
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        sal     rax, 24
        mov     rcx, rax
        movabs  rax, 280375465082880
        and     rax, rcx
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        sal     rax, 8
        mov     rcx, rax
        movabs  rax, 1095216660480
        and     rax, rcx
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        shr     rax, 8
        and     eax, 4278190080
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        shr     rax, 24
        and     eax, 16711680
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        shr     rax, 40
        and     eax, 65280
        or      rdx, rax
        mov     rax, QWORD PTR [rbp-8]
        shr     rax, 56
        or      rax, rdx
        pop     rbp


Swap64(unsigned long):
        mov     rax, rdi
        bswap   rax


for this code

#include <cstdint>
#include <cstddef>

void Add5(uint32_t Array[], size_t Length)
    for(size_t i = 0; i < Length; ++i)
        Array[i] += 5;

here is a naive compiler output

        test    rsi, rsi
        je      .L1
        lea     rax, [rdi+rsi*4]
        add     DWORD PTR [rdi], 5
        add     rdi, 4
        cmp     rdi, rax
        jne     .L3

here it is vectorized with -`march=skylake`. It uses AVX2 registers to process 8 values at a time

Add5(unsigned int*, unsigned long):
        test    rsi, rsi
        je      .L12
        lea     rax, [rsi-1]
        cmp     rax, 6
        jbe     .L6
        mov     rdx, rsi
        shr     rdx, 3
        sal     rdx, 5
        vmovdqa ymm1, YMMWORD PTR .LC0[rip]
        mov     rax, rdi
        add     rdx, rdi
        vpaddd  ymm0, ymm1, YMMWORD PTR [rax]
        add     rax, 32
        vmovdqu YMMWORD PTR [rax-32], ymm0
        cmp     rax, rdx
        jne     .L4
        mov     rax, rsi
        and     rax, -8
        test    sil, 7
        je      .L14
        lea     rdx, [rax+1]
        add     DWORD PTR [rdi+rax*4], 5
        cmp     rsi, rdx
        jbe     .L12
        add     DWORD PTR [rdi+rdx*4], 5
        lea     rdx, [rax+2]
        cmp     rsi, rdx
        jbe     .L12
        add     DWORD PTR [rdi+rdx*4], 5
        lea     rdx, [rax+3]
        cmp     rsi, rdx
        jbe     .L12
        add     DWORD PTR [rdi+rdx*4], 5
        lea     rdx, [rax+4]
        cmp     rsi, rdx
        jbe     .L12
        add     DWORD PTR [rdi+rdx*4], 5
        lea     rdx, [rax+5]
        cmp     rsi, rdx
        jbe     .L12
        add     rax, 6
        add     DWORD PTR [rdi+rdx*4], 5
        cmp     rsi, rax
        jbe     .L12
        add     DWORD PTR [rdi+rax*4], 5
        xor     eax, eax
        jmp     .L3
        .long   5
        .long   5
        .long   5
        .long   5
        .long   5
        .long   5
        .long   5
        .long   5

while it looks a lot longer, it is actually MUCH MUCH faster because it processes multiple at a time with ultimately less instructions than if you were to process them individually


OP here. This thread turned out really nice!
>sick lib bro
Thanks, really appreciate it.
>How do you measure energy consumed? Or do you just go by "less cycles = less energy"?
I don't measure it in general, I'm not autistic enough for that. However, if an algorithm runs faster, uses less memory, etc., then it usually runs as good on cheaper hardware as a bad algorithm on expensive hardware. If it runs fine on a Raspberry Pi, then I don't have to use a laptop or tower for it. Depending on the impact of your software, this can make a huge difference. For projects that I am serious about, I always try to keep the demands on the hardware as little as possible. Although this mindset leads to me almost never finishing anything nontrivial, I can bear with it, because it feels right to me (crazy, I know).
I'm working on (for years now, sporadically) a language that basically has all the good features of C++ while having a clean syntax (type 2 grammar), so it becomes easy to parse. I hate how, in C++, you cannot correctly highlight a program's syntax using a simple grammar, because you need a full blown parser to recognise which name is a type and which name is a value. This semantic ambiguity is something that I don't like.
A * b;
Did I just create a pointer
to type
, or did I multiply variables
Cleaning up the syntax aside, I'll probably also integrate coroutines, but in a way that allows the user to control how an where the stackframe is allocated. This is the one thing that I like the least about the C++ coroutine TS. You can even control which allocator your vectors use, but you can't control anything about coroutines, which is a shame. And the design seems to enforce that you allocate stacks for every coroutine, which is pretty resource-consuming, once you ramp up the concurrency.
I will probably also add lazy evaluation as a native feature, so that people can finally do stuff like implementing early-out in their overloaded
operator, which is currently impossible in C++.
The phenomenon of people just throwing hardware at problems is one of the reasons why I am very hesitant to upgrade my computer. But I'll have to go from 4GiB RAM to 8GiB, because even having open Sublime Text, Firefox, and VLC player is enough to bring my PC to a crawl after a few hours. It seems like Mozilla bloated their browser so much as time went on, that with every update, it needs more RAM. A few years ago, I never had any problems with the amount of RAM I have, except maybe when playing video games.
Meanwhile, people buy 64GiB RAM for their laptops…
That's pretty cool. I never got deep enough into all the SIMD stuff, regrettably.
>every algorithm's "end-game" is to be implemented in hardware.
I am trying to keep the language designed such that it is easy to program for custom hardware: You can declare a
, which basically is just a byte array and a set of functions on it (which might be inline assembly instructions). Thus, you could implement native 32-byte floating point types with ease, if the hardware supports it. This feature is inspired by Bluespec, but I am unsure whether I should go even further, so that you can actually design hardware using my language. I'm kind of torn about it because it will most probably clash with its usability as a software programming language.


Icelake just came out.
Looks great. Can't wait to play with the new instructions


This might be a stupid question, but isn't vectorization the same thing as SIMD?


Vectorization is the general high level concept of having data-parallelism in some way. Being able to process large chunks of data in a parallel way.
SIMD is an implementation of vectorization


I want to learn C++ with all the shiny additions from the recent standards, what would you recommend to study? I knew some basic C++ years ago, know C fairly well and used other OOP languages.


Why do you want to know it?

If you want to know it to use for something, then
practice using it for that thing,
follow best practices, and
you will learn everything you need to know.

[Return] [Go to top] [ Catalog ] [Post a Reply]
Delete Post [ ]