- Back to Home »
- Java Performance
Posted by : Unknown
Friday, July 26, 2013
Java Performance in the Real World
The
growing excitement about Java in corporate IT departments has been closely
followed by a growing concern about its performance. While numerous trade
seminars, presentations and articles have explored Java performance, very few
have focused on real world IT systems. This article discusses the performance
of Java-based applications that solve real IT problems: enforcing business
rules, accessing disparate data, and presenting information graphically.
To
determine Java's performance impact in solving each of the above problems, we
developed a variety of stand-alone and distributed object applications in both
Java and C++, and compared their throughput under various load conditions. We
also applied the latest performance enhancing techniques available in both
languages such as threads, and Just-In-Time compilation and measured their
impact. Our findings are both surprising and contrary to some widely held
notions about Java's performance.
Standalone applications
Since a
real world IT system may consist of several standalone components, it is
instructive to begin by analyzing the main tasks of a standalone application.
According to Carmine Mangione in a Feb’98 JavaWorld article,
these are:
1.
Loading program executables: either
from local storage (hard disk) or remotely over the network
2.
Running program instructions,
including math operations, method calls and other business logic.
3.
Allocating and freeing memory
used by the application.
4.
Accessing system resources such
as I/O, file handling, and printing.
To the
above, we would add the following tasks for a typical GUI-based IT application:
5.
Rendering and managing a
Graphical User Interface (GUI)
6.
Handling user events (mouse
clicks, input, drag-and-drop, etc).
The
diagrams below illustrate how a Win32 and a Java application handle these
functions.
Architecture of a
standalone C++ application on NT
Architecture
of a standalone Java application
From
the above diagram it can be seen that in addition the 6 sets of functions
described for a C++ application, a Java applications also:
7.
Runs a Byte-code verifier, and
Security Manager during program loading to prevent illegal stack overflow and
data conversion and to restrict access to resources such as network sockets.
This would seem to indicate that a Java application would load more slowly than
an equivalent C++ one, but in practice, there are 2 reasons why Java
applications often load faster, especially across a network:
·
Java executables are
significantly smaller in size than their C++ counterparts.
·
The class files which make up a
Java application can be loaded dynamically as needed rather than being loaded
all at once, as is often the case with C++ libraries.
8.
Uses either a Byte-code
interpreter or a Just-In-Time (JIT) compiler to translate byte-code to machine
instructions before executing them. If each program instruction is interpreted
and then run, the application will perform 3-10 times slower than a compiled
C++ version. However, JIT compilers significantly reduce the performance lag by
compiling often-used instructions into machine code on the fly. They also
perform some primary and secondary optimizations on the code, similar to, but
not as extensive as the optimization done by a good C++ compiler. We would
therefore expect program instructions to be executed fastest by a C++ program,
with a JIT-enabled Java VM close behind and a Java interpreter performing much
slower.
9.
Performs garbage collection by
identifying and releasing memory that is no longer in use and moving memory
around to prevent fragmentation. Garbage collection can reduce one of the
leading causes of bugs in IT systems, memory leaks. However, because it entails
using handles for object references and requires the garbage collector to be
constantly running in the background, it produces a performance penalty.
10.
Uses peer interfaces for
accessing system resources, rather than directly calling the underlying
operating system. In the case where these peer interfaces map directly to the
underlying subsystem (print, I/O, graphic) there is minimal overhead associated
with this technique, rather, it maintains a consistent interface to system
resources across platforms. However, where the peer interfaces don’t directly
use the underlying subsystem, execution time greatly increases.
Real-world results
Having
looked at the reasoning behind potential differences in C++ and Java
application performance, the following section details the actual results of
benchmarking both kinds of applications while performing the tasks outlined
above.
All the
tests were carried out on a Pentium II 266Mhz machine, with 128 MB RAM, running
NT 4.0 Workstation. C++ applications were developed using Microsoft Visual C++
5.0, Java applications using the Java Developer’s Kit 1.1.6 and Visual Basic
5.0 was used to develop the non-Java GUI interface. Three different
applications were used to isolate and time specific tasks:
1.
A Program Execution App was developed in C++ and Java to measure
execution of program instructions, loops, and method calls.
2.
A Memory Analyzer was developed in C++ and Java to measure memory
allocation and deallocation.
3.
A GUI Plus application was developed in Visual Basic 5.0 (compiled
executable) and Java to measure program loading, graphics, and event handling.
Test |
Description
|
Time (s)
C++
|
Time (s) JIT
|
Time (s) Interpreter
|
Integer
Division
|
This
test loops 10 million times on an integer division.
|
1.3
|
1.4
|
3.8
|
Member Method
|
This
test loops 10 million times calling a member method, which contains an
Integer division.
|
1.3
|
1.5
|
9
|
First Million Primes
|
This
test calculates the first million prime numbers. It exercises variable
access, array access, and function-call invocation.
|
400
|
420
|
1800
|
Memory
Allocation
|
This
test allocates and frees 10 million 32-bit integers
|
0.7
|
1.6
|
1.6
|
Program
Load
|
This
test loads the VB and Java GUI using an executable to tabulate time
|
0.6
|
0.3
|
0.3
|
Render GUI
|
This
test measures the time needed to render a complex screen with buttons,
fields, list boxes, etc.
|
0.02
|
0.3
|
0.3
|
Perform
Events
|
This
test performs 1 million button clicks on the GUI
|
0.5
|
0.5
|
2
|
Analysis
The
following observations can be made from the test results:
·
It can be seen that for the
first three tests, the JIT-enabled Java application is only slightly (0-15%)
slower than the C++ version, while the interpreted code is 3-8 times slower.
The large performance penalty for interpreted code is because of multiple
interpretations of the same code as well as lack of any optimization. The JIT
code is slightly slower because the compiler performs fewer code optimization
and virtually no global optimization and also because of the Java’s use of
handles for object reference
·
Memory and deallocation is
slightly more than twice as slow for Java applications. This is because of the
working of the garbage collector discussed earlier.
·
Program loading is actually
faster in Java, as anticipated. This is primarily due to the difference in
executable size.
·
While the JIT-enabled Java GUI
handles events just as fast as its VB counterpart, it is 15 times slower in
rendering the GUI. This large difference can be attributed to the fact that
while the VB application calls the Win32 graphics subsystem directly, the Java
GUI uses the Java Foundation Classes (JFC) framework which has its own built-in
graphics engine.
Distributed applications
Architecture of a Distributed Application
While
there exist numerous real-world IT systems that are standalone, a majority of
business-critical applications developed in the last few years have been
distributed, either using a client-server architecture or more recently
distributed objects. Thus, a true measure of Java’s performance in real world
IT systems can only be gained by comparing the performance of
business-critical, distributed Java and C++ applications. The architecture of a
typical distributed object application is shown above. It consists of a GUI
client that communicates via an object protocol with a set of application
server objects that encapsulate the business logic. These server objects
provide persistence by interfacing with object and relational databases via
data access objects. They can also participate in transactions by using
Transaction managers such as Tuxedo and Encina and frequently access legacy
data and applications on the mainframe.
In
addition to performing all the tasks of a standalone application (program
loading and execution, memory management, accessing system resources and
graphics processing), distributed applications also:
1.
Make distributed object
requests between client and server objects, among server objects and between
server and data access, transactional or legacy objects. Depending on the
object middleware used, these requests can be in DCOM, CORBA IIOP or in the
case of Java applications, RMI.
2.
Access disparate data
(relational and non-relational), using a variety of protocols including ODBC,
OLE DB, Embedded SQL and JDBC for Java applications.
3.
Interface with legacy systems
using middleware such as CORBA or Microsoft Transaction Server, or even the JDK
running on the mainframe or AS/400.
4.
Make use of threads.
Multithreaded clients allow an the GUI to quickly return control to the
end-user while processing a request in a separate thread. Multithreaded servers
can handle simultaneous requests from multiple clients are can scale more
easily.
Real-world results
Having
described the additional tasks required of distributed applications, the
following section compares the performance of C++ and Java implementations of
such applications performing the above tasks.
As
before, all the tests were carried out on a Pentium II 266Mhz machine, with 128
MB RAM, running NT 4.0 Workstation. In this case two workstations were used to
distributed the application with many clients and servers running on each. C++
applications were developed using Microsoft Visual C++ 5.0 and Java
applications using the Java Developer’s Kit 1.1.6. Visbroker 3.2 (C++ and Java)
was used as the CORBA object middleware. For data access, the C++ application
used Microsoft’s ODBC driver to access data in MS SQLServer 6.5 while the Java
application used a JDBC Type 3 driver from Intersolv to access the same data.
Three different components were used to isolate and time specific tasks:
1.
An Object Request component was
developed in C++ and Java to measure distributed object requests.
2.
A Data Access component was
developed in C++ and Java to measure data access from SQLServer.
3.
A MultiThread component was developed in C++ and Java to measure
synchronized thread calls.
Test |
Description
|
Time (s)
C++
|
Time (s) JIT
|
Time (s) Interpreter
|
||||
ORB
init and bind
|
This
test measures the time needed to initialize the CORBA client and bind to
remote application server
|
1
|
0.9
|
0.9
|
||||
Single object invocation
|
This
test instantiates a remote object which performs 1 million operations
|
0.02
|
0.03
|
0.7
|
||||
Multiple
object invocation
|
This
test loops instantiates 3000 remote objects each of which perform 1 operation
|
36
|
22
|
22
|
||||
Database
connection
|
This
test connects to a remote SQLServer 6.5 database
|
0.3
|
1
|
0.7
|
||||
Select
|
This
test loops 100 times and retrieves 10 rows from the database
|
27
|
12
|
12
|
||||
Synchronized
Method
|
This
test measures the time needed to access a synchronized method 20000 times
|
10
|
17
|
18
|
Analysis
The
following observations can be made from the test results:
·
A JIT compiler provides limited
performance improvement for distributed applications. This can be surmised by
the fact that results for most tests are almost identical between JIT-enabled and interpreted Java applications.
In general, the network hop is the gating factor for distributed object
requests while the database driver is the gating factor for data access. The
latter conclusion is drawn from the fact that the Java application access data
more than twice as fast as the C++ application, primarily because it uses a
Type 3 JDBC driver with server-side SQL execution, which is much more efficient
than the client-side execution provided by the C++ ODBC driver.
·
Remote object instantiation is
faster with Java. This could be attributed to a better implementation of the
CORBA Basic Object Adapter in Visbroker for Java vs. Visbroker C++.
·
As expected, synchronized
methods are slower in Java than C++. This is because such methods keep both a C
and Java stack in memory and also execute a significant amount of additional
code to provide thread-safety.
Performance enhancing techniques
While
distributed applications in general are not overly affected by Java’s
performance limitations, there are two important reasons for trying to enhance
their performance:
A.
If a distributed application
performs computationally intensive work, or its GUI is fairly complex, then
some of the limitations of the standalone JIT code, as seen in the GUI and
method-call tests, can become more pronounced. A hint of this can be seen in
the “single-object invocation” test above where the C++ version is slightly
faster than the Java JIT code, because the remote operation is performing some
computational work
B.
While the relative performance
of C++ and Java distributed applications may be similar, there is definitely an
advantage in increasing the absolute performance of a Java distributed
application, so that it provides higher thruput, increased transactions/sec,
and greater scalability.
Performance
can be improved at several levels: At the lowest level, providing a faster
Virtual Machine and better JIT compiler can produce better optimized and faster
executing code. A level above this are Java performance tools and libraries
such as specialized libraries for I/O, as well as faster data access drivers.
Finally, some of the biggest performance gains can be obtained by profiling the
application code and then optimizing it using proven techniques. Each of these
methods are explored in greater detail below:
Using a faster Virtual Machine and JIT compiler
There
are numerous Java VMs and JIT compilers available, especially on popular
platforms such as Win95, NT and Solaris. The speed of VMs is usually rated
using one of two popular benchmarks: Jmark 2.0 and CaffeineMark 3.0. Each runs
a variety of tests including processor-intensive tasks, GUI and thread calls on
a particular VM and combines the results into a composite Jmark or CaffeineMark
score which can be used by an evaluator (but more often by the vendor’s
marketing folks) to make (or push) a VM selection. Some of the faster VM’s and
JIT compilers on the market today include:
1.
Supercede 2.0 Pro compiler and
VM with native code generation.
2.
TowerJ compiler and VM with
native code generation .
3.
Microsoft VM 3.1 with
generational garbage collector. This VM is available as part of the Visual J++
product or as a free download from Microsoft’s Java website at http://www.microsoft.com/java/. In
tests performed by Sun engineers at the JavaOne conference, the MS VM executed
20-45% faster than the JDK 1.2 beta3 and the JDK 1.1.6 VMs with JIT.
4.
Kaffe, available free on 30
operating systems, includes JIT conversion from byte to native code.
5.
Symantec VM and JIT available
as part of Symantec Visual Café and is also bundled with Netscape Navigator and
Sun’s JDK 1.1.
6.
Inprise VM and JIT available as
part of Jbuilder 2.0.
Using Java performance tools and libraries
Tools
available for optimizing Java applications include:
1.
The javac compiler itself with
the –O option for optimization. Using this compiler option provides some
primary optimization and dead code elimination.
2.
JAX from IBM can reduce the
size of a Java application and make it more efficient (upto 50% reduction in
size) by removing dead code, inlining method calls, etc. It is available for
free download at http://www.alphaworks.ibm.com/formula/JAX.
Java
class libraries available for improving performance include:
1.
The Windows Foundation Classes
(WFC) and the Jdirect API from Microsoft, which allow Java applications to call
the Win32 subsystem directly and thus greatly improve graphics handling, and
other system tasks. The tradeoff is application portability because the API’s
to these libraries are used as an alternative to standard Java AWT/JFC calls.
These libraries are available at with Visual J++.
2.
Perflib provides a set of Java
classes for high performance sorting, searching, I/O, etc. The routines claim
to be upto 5 times as fast as standard JDK implementations.
3.
A variety of Type 3 and 4 JDBC
drivers are available for fast, native access to most relational databases.
Some of the popular vendors include Inprise with their Data Gateway and
Microfocus/Intersolv’s DataDirect product suite.
Profiling and Optimizing Application code
While
the above techniques can yield significant performance improvements, tuning
application code can potentially provide the greatest “bang for the buck”. This
is especially true if a major inefficiency can be identified and eliminated in
the 20% of code that is executed 80% of the time. A good way to discover
programming inefficiencies is to run the Java code through a profiler.
Profiling allows the detection of performance bottlenecks, identification of
CPU and memory intensive code and collection of function and even line-level
timing data. Some of the Java profilers in the market include:
1.
Visual Quantify from Rational
Software.
2.
OptimizeIt from Intuitive
Systems.
3.
JProbe from KLGroup.
4.
Jinsight, a freeware profiler
and memory analyzer from IBM.
Once
the problem areas of an application are identified, there are a number of steps
that can be taken to improve overall performance. At the component level, there
are numerous coding techniques that can be used to increase code efficiency and
avoid problem APIs. The “Java performance tuning tips 1.0” article from IBM and
the “Java performance and optimization” article from Inside Java, (available in
the Resources section) both discuss some of these techniques in detail. For
improving the performance of business-critical distributed applications, the
following techniques are available:
·
Avoid synchronized methods in
multithreaded applications, if possible. As seen from the tests above, they are
fairly slow and resource intensive. In some cases, it might be better to create
two versions of a method, one synchronized and one non-synchronized, and only
use the former when absolutely necessary.
·
Pass objects by value when
appropriate. Some middleware environments such as CORBA make it especially
convenient to pass objects by reference to a remote module. The problem with
this approach is that anytime the remote module needs access to the object, it
needs to make a remote call back to the passing object. So in cases where a
passed object needs to be accessed frequently, it makes more sense to take the
initial hit of passing it by value.
·
Use JDBC with precompiled SQL,
rather than Dynamic SQL, for oft-repeated queries. Precompiled SQL is stored in
the database server and repeatedly executed with new inputs while dynamic SQL
is recompiled every time it is run. Using this technique in our tests has
reduced the access time for repeated queries 2-8 times!
·
Multiplex database connections
across several clients and maintain them rather than creating and destroying a
connection for each client. This has the dual advantages of reduced connection
time and improved scalability of the application.
·
Many real-world applications
require several layers of security, including authentication, authorization,
data encryption and non-repudiation. Since security adds significant overhead
to a distributed request and encryption algorithms are computationally
intensive, use the minimum security level possible for a given operation and
user.
·
A common technique for
traversing firewalls is to tunnel the object request through HTTP. While this
approach is the most flexible, it comes with a significant performance penalty.
A better approach, when possible, is to open a minimum set of firewall ports or
to use an object-protocol friendly firewall proxy (such as Wonderwall from
Iona).
·
Finally, if delays can’t be
avoided due to a large number of system users or slow-running server objects,
their impact on the client can be minimized by creating a separate thread to
handle the object request and returning control of the GUI back to the user.
Future Improvements
The
great interest that Java is receiving from corporate IT departments has caused
system vendors to continue the rapid pace of advancement in this technology.
High on their priority list is further improvements in the speed of Java
applications, both standalone and distributed.
For
standalone applications, Sun is delivering JDK 1.2 later this year, which
promises performance improvements in strings, vectors, dates and the JIT
compiler. Q1 ’99 heralds the availability of the revolutionary HotSpot compiler
from Sun which in preliminary tests at JavaOne ran applications faster than
C++! HotSpot is a cross between a JIT compiler and interpreter and provides its
dramatic performance improvements with a much-improved generational garbage
collector, fast thread synchronization and “adaptive” compilation.
While
improved object middleware such as CORBA 2.2 and RMI 2.0 promise to increase
the speed of distributed Java applications, the most significant development in
this area is the rapid advancement of Java Application Servers. These
applications servers host the server-side business objects and automatically
provide them with multithreading, database and resource pooling and
load-balancing, all of which promise to make real-world Java applications the
fastest and most scalable kind of distributed applications available.