Core S2 Software Solutions

A light discussion on programming languages

Programming languages are to programmers what saws are to carpenters. It is one of our most important tools among many we need, but there are still many different types of saws needed that vary enormously across different applications. This analogy fits perfectly with programming languages and the variety that we have developed since the 1940’s. I’ve recently been thinking about how languages are interpreted to machine code now-days as well as other language concepts, and learned quite a bit through my coreBasic project.

Sure, developing a new programming language is as unnecessary as developing a new OS, but it is still a healthy academic exercise. There are even tools out there that allow you to define a programming language syntax, and let the interpreter do all the work, but I think it’s much more fun to write all that yourself! Let us go through the several layers and ways in which we can interpret a programming language ranging from binary to compiled to interpreted languages. Naturally you could learn much more through the Wikipedia articles on each topic, but again it’s fun to casually walk through and explain each type.

Continue reading for more!

Binary / Machine-Code

Binary, the number encoding scheme, truly is the lowest level of code. A number can represent code or data, but the fundamental reason why we encode these numbers in binary is that modern processors are built on transistors and thus discrete logic gates. Sure, we could technically develop gates that are based on any arbitrary base, but that both complicates the system, and the increased cost and complexity do not add to any sort of performance gain. That, and working with variable voltage to represent an arbitrary number has some nasty real-world conflicts: if the voltages varies too much (as it tends to do in real-world applications), your signal starts loosing the original value and fails to cary onto the next component its original meaning. By sticking to binary, we know that a signal represents either 0 to 1 by being less than or more than certain threshold (i.e. 2.5volts in a 0 to 5 volt system). Simplicity leads to solid signals, and solid signals lead to consistent processing.

We now know why we use binary, but does it mean? How in the world does the processor know that 0x8B (hex) is the operand for subtraction (this was made up, don’t think I’m looking up the op-codes for any specific processor). This is a really neat question and a super cool answer, but is by far too complicated to answer in this article. Ask a friend in computer engineering or find a book on processor design, but it boils down to a cool concept: the binary representation of the operand turns on and off certain parts of the processor. We know that since 0x8B (again, this is a made-up example) has the lowest two bits as high: this might tell the processor this is a math operator. The next two higher bits are b01(binary) which might indicate that out of the four basic math operators (addiction, subtraction, multiplication, and division), this is the second operator of subtraction. Though the ALU (the unit responsable for math operations) might computer all four of these results in parallel, we turn off the output of the three other responses using some sort of gate, while the result of the subtraction is seen because that b01 signal tells us to look only at the subtraction results. If you load up the operators list for MIPS, you can actually see a pattern with similar operators. This was not done because it would be “nice” to group up similar opcodes, but because fundamentally some parts of the bit-level representation have to be closely associated!

Other parts of the operator might encode for other information: depending the architecture, some parts of the code might represent the destination address of a jump instruction, the register we are to add to, the address in which we write out data to. Also note that operators don’t even have to have the same number of bits or bytes! On X86, the processor most modern PC’s use, it is a variable-byte instruction platform. This means that some instructions may only take one byte to represent, while others can take up to a dozen!

Many new software engineers think that this machine code directly maps to a processor, meaning the n-byte machine instruction doesn’t get decomposed and once it is in the instruction pipe and worked on, nothing outside of the ALU / relavent component happens. Turns out on many modern processors (more so for media-optimized systems like the x86’s MMX extension, not so much for smaller embedded systems due  to price point vs. performance costs) there is an actual on-board micro-processor. A single increment operation is a bad example, but the idea still holds for other ops: some instructions, especially batched, media, or complex arithmetic operations, are handled by different parts of the processor and are actually re-interpreted sometimes into special optimization cases. Many software engineering students take a basic processor design course and learn that instructions are usually split into an op-code (operation code) and some arguments (or indexes of registers, or addresses of an args list, etc), and the processor directly executes the given instructions. To speed things up, turns out there are dedicated parts that re-interpret instructions off the main core (though it could be on the same core, but that is up to the implementation details) or some optimization rule is applied so that a division turns into a faster bit-shift operator. This is called micro-code, which has another nice feature: pseudo-portability! Binary code is normally is not portable, which means that code you built for x86 will not run on MIPS processors. What happens if your old x86 code is running on a new x86 processor that has taken your old op-codes and optimized them? This microcode will step in, find these old ops, and execute them using the newer (and faster) components! This way you, as the programmer, don’t have to write anything new! You can still release the same code, use the same compiler, and yet clients / end-users get a performance boost!

Software engineers really need to understand that assembly / machine code really isn’t the last stop in terms of optimization: each new generation of processors isn’t faster because of the smaller die, but is faster because of some amazing engineering at the components level!

Compiled Code

The next layer on top of machine code would be compiled code. This is simply any formal language that is interpreted by a program into the target processor’s (or VM for that matter) architecture. C, C++, and Fortran (and many, many others) are classic examples of this. This is also where some high-level optimizations are done, that no assembly-level language could understand and optimize; though this is debatable, look at how LLVM optimizes itself after it is compiled to their assembly form.

From a practical point of view, these languages were critically important in the world of computer science: it opened a door to writing fast code (i.e. compiled to binary form) in a portable, easy to maintain, and readable language. The aspect of portability is critical, as it is much easier to re-write one single program (the compiler) than have to port all of the code from one architecture to another. This premise is exactly why C came about and was used as the language for the first portable operating system: UNIX.

What is also great about this level is the kind of optimization that can take place. As mentioned above with the LLVM example, there are ways to optimize the concept you attempted to implement. Assembly, at least from the human level, is very hard to maintain and understand, and thus trying to optimize a given person’s code is very difficult without some sort of abstraction like variable names rather than variable stack positions. C allowed developers to write code that was easy to review, improve upon, share, and “clean up”. Optimize could finally be done at the conceptual level and from a higher scopre: a developer could take a step back, realize that maybe the base loop wasn’t efficient, and clean that code without having to worry about manipulating jump addresses or memory read offset issues, etc.

Virtual Machines

VMs are a simple concept: compile your code into an assembly-like language (called byte-code) that generally tends to map one-to-one to instructions on most modern processors, so that the code is portable, fast to execute, and porting a VM is a relatively easy task.

Virtual machines are not nearly as young or as “bad” (read: slow) as people imagine them to be. Many novice programmers learn Java first and some get frustrated with the run-times issues and (arguably) bloated syntax. Then, the moment they move onto a native (compiled code) language like C++, they tend to want to go back to a managed environment because of how hard it is to manage their own memory. Why the love-hate-love relationship? Honestly Java does make your life as a developer so much easier: the memory is managed, it executes on any VM-ported platform, and has tons of nice libraries built in, all non-existant in languages like C and C++ if we ignore standard and 3rd party libraries.

VMs tend to be managed and sandboxed environments. This means that if your code breaks, the process doesn’t truly crash, just that the VM halts, and you can quickly learn quite a bit through the language’s tools rather than have to read through binary-level stacks and function calls obscured through the un-helpful compiled code. There is also an advantage to the user: if the code is malicious, it can’t do any real harm like read another processe’s memory space, since the process is held without it’s own sandbox and fundamentally cannot reach out and hurt the user’s system (though if the VM itself has a vulnerability, you could compromise the host system abusing that issue, reaching from your code, through the VM, to the host). A helpful feature for end-developers is that managed environments, though this isn’t part of the true definition of a VM, tend to have managed memory: programmers can allocate whatever they want and whenever they want and the system, not the developer, is responsable for retaining or releasing memory overtime. This completely avoids the very common mistake of heap-allocation mismanagement.

Another benefit is that modern VMs tend to do JIT, or Just-In-Time, compiling. This turns generated VM byte-code, at run time, into native instructions that can be executed again later on. As an example, imagine your program is running for the first time. Each time, after a certain number of byte-code instructions are executed, a page of the generated native code is saved. When your execute your program a second time, or the flow of execution revisits something already executed, no byte-code needs to be read, just that page of native code is executed. Many other languages that use either a VM base or an interpreter now use JIT as a way to massively speed up an application, though there is a simple draw back: high level optimizations that languages like C and C++ cannot take place if the code is being compiled page-by-page.

Interpreted Languages

Last, but not least, is the class of interpreted languages. This is the easiest to understand the benefits and negative features of such a language. Interpretation of code has to always happen to turn it into machine code. Even assembly code has to be interpreted into exact machine code, since the computer doesn’t understand the “NOP” instruction, but it does know what 0x90 (hex) means. Interpretation can be done quite quickly, as there are plenty of known algorithms that can verify syntax, build functions stacks, call functions, etc. The issue is when do you want to do this interpretation: at compile time (before a user even gets the finished executable), or at run time (at the user’s expense).

Clearly compiling the code beforehand will always generate faster code; if you interpret the language at run there, the processor will have to run through dozens of instructions just to understand what one op you wrote means. The simplest example of doing an addition in an interpreted language still takes tons of processor cycles because it has to read the token from memory, then load the two arguments (addition is a binary-operation, meaning you should be adding an element on the left with an element on the right), and then store the result. If compiled, it would literally take one instruction, the processor’s addition instruction.

So why interpreted languages to begin with? There are some advantages: quick to change and edit, very safe since an interpreter knows exactly what the code is trying to do, no compile / linking time (which is long in big projects), and tends to be in some ways easier to implement than a compiler.

Conclusion

Through cBasic, I’ve learned quite a bit about the implementation level details of a compiled and interpreted language, as well as how a VM works. Though I have yet to implement a JIT, I see conceptually how it works, and know enough assembly to probably get something working in short order. Again, all of this is a fun academic exercise, and with a market flooded with so many languages, it is flat-out impractical to create your own language in a comercial environment. Game programmers now are using LUA (interpreted) and Javascript (interpreted / VM / JIT) for scripting, native languages are becoming less and less used outside of low-level programming, and managed code seems to have dominated the market for business applications for 10+ years now. Who knows what the next big paradigm shift will be for programming?

This entry was posted in Programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


*

Sites map