Table of Contents
ProMC User's Manual
(written by S.Chekanov, ANL)
ProMC is a package for file input and output for structural event records (such as Monte Carlo or data events). The main features include:
- Compact file format based on a content-dependent “compression” using Google's protocol buffers. See the discussion here. Although we use the word “compression”, it it just a very compact binary format (see the discussion below).
- ProMC is not based on a compression algorithm (like gzip). This is just a streaming of data into a binary wire format. Therefore, no CPU overhead due to compression/decompression is expected.
- Self-describing file format. One can generate C+/Java/Python code to read or write files from existing ProMC data file and make future modifications without the language-specific code used to generate the original file.
- Multiplatform. Data records can be read and write in C++, Java and Python. PHP can be used to access event records.
- Forward and backward compatible binary format.
- Random access. Events can be read starting at any index.
- Optimized for parallel computation.
- Metadata can be encoded for each record which will allow fast access of interesting events
- Logfile can be embedded into the ProMC files. Useful for Monte Carlo generators.
- No external dependences. The library is small and does not depend on ROOT or any other libraries.
- Events can be read from remote files (using random access)
- Well suited for archiving data. In addition of being very compact, any future modification can be made by generating analysis code using “self-describing” property.
ProMC(“ProtocolBuffers” MC) is based on Google's Protocol Buffers, language-neutral, platform-neutral and extensible mechanism for serializing structured data. It uses “varints” as a way to store and compress integers using one or more bytes. Smaller numbers take a smaller number of bytes. This means that low energetic particles can be represented by a smaller number of bytes since the values to store 4-momenta are smaller compared to high-energetic particles. This is important concept to store events with many soft particles (“pileup”) in the same event record since they use less disk space.
This project is tailored to HEP ANL BlueGene/Q project, since can provide a simple and efficient way to stream data from/to BlueGene/P.
The main idea behind ProMC is to use “content-dependent” compression to store particles depending on their importance. An 14 TeV pp collision event with 140 pileup soft events can have more than 10k particles. Most of them have low pT (“soft”). If we can encode 4-momenta using integer values, then soft particles can be represented by smaller values compared to most interesting (“hard”) particles. If we encode this information using Protocol buffers “variants”, we can use less bytes to store soft particles from pileup. Read Protocol-buffers Encoding.
However, Protocol buffers is still not sufficient, since it can be used write and read separate “messages” (i.e. “single events”). ProMC is designed to store multiple “messages” (or events in HEP terminology”) in a file using a platform neutral way. It also constructs a header for events and organize “messages” in a form suitable for event Monte Carlo records.
Example: A typical HEPMC file size for 100 ttbar events with 140 pileup events (14 TeV) is 1,230 MB. Gzipping this file reduces the file size to 445 MB (in which case it is impossible to read it). The main objective is to store such events in a platform-independent file of size of about 300 MB and still be able to read such data (even using random access). As you will see below, such goal has been achieved using the ProMC package.
About the data compression
When we say “compression”, we typically mean some compression algorithm. In ROOT, up to 50% of CPU is spent on compression/decompression of data. ProMC does not use any algorithm to compress or decompress files. It just streams data into a binary format without CPU overhead.
- HepMC is a poplar, platform independent way to store event record. The most common approach is save records in ASCII files. A typical size of data for 100 pp collision events (14 TeV) with 140 pileup events is 1.2 GB. A gzip compression can reduce it to 450 MB (mainly for storage). But there is no way to read and write compressed files. Despite large size of HEPMC files, the main advantage is that it's multiplatform and human readable.
- StdHep C++ library is no longer supported. It uses a gzip compression, bit not at the byte levels as in case of vaints. The format is not multiplatform.
- LHEF is another event record. It is XML based and inherits problems which were solved by Google. Protocol buffers have many advantages over XML for serializing structured data: 3 to 10 times smaller (assuming no zip compression) and 20 to 100 times faster.
- Another way to store is to use ROOT. The format is not fully multiplatform (although attempts were done to read using Java). ROOT uses the standard standard “gzip” compression and fixed-precision values. This means that soft and hard particles takes same storage since represented by a fixed number of bytes.
Each event of the ProMC library is a “ProtocolBuffer” message. Float values (pX,pY,pZ,M) are encoded as “int64” integers. In such representation, 0.01 MeV is minimum allowed energy, while 24 TeV is maximum allowed energy.
Here is a mapping table:
|Energy||Representation||How many bytes in encoding|
|0.01 MeV||1||1 bytes|
|0.1 MeV||10||1 bytes|
|1 MeV||100||2 bytes|
|1 GeV||100 000||4 bytes|
|1 TeV||100 000 000||8 bytes|
|20 TeV||2000 000 000||8 bytes|
Thus, a 4-momentum of a soft particle (~ MeV) can be represented by a reduced number of bytes, compared to fixed length encoding. For a typical pT spectra (falling distribution), this means that the bulk of particle spectra at low pT is compressed more effectively, than for particles at the high-pT tail. This compression depends on the pT spectra.
There are other places where the Google's compression is efficient for MC event records. For example, partons typically have small integer values of:
- PDG_ID - PDG ID
- status - status code
- daughter1 - 1st daughter
- daughter2 - 2nd daughter
- mother1 - 1st mother
- mother2 - 2nd mother
thus they are compressed more effectively using “varints” than final state or exotic particles with large PDG_ID number. Also, light particles (partons) will be compressed more effectively due to their small mass.
Another place where ProMC tries to optimize storage is to set masses for most common particles as a map in the header message of the record. For example, masses of pions and kaons can be set to 0 value (1 bit). During the reading, the masses are restored using the map stored in the header file.
ProMC record layouts
A typical ProMC has 4 major ProtoBuff messages:
- File description message (“file metadata”) with timestamp, version, Nr of events, description.
- A header “message” to keep MC event metadata. It keeps global information about initial colliding particles, PDF, cross section, etc. It also keeps track of which units are used to convert floats to integers. In addition, it keeps information on PDG particles (particle ID, masses, names, charges, etc.). The header is encoded as a separate Proto buffer message an is suppose to provide the necessary description of events.
- Events as separate Protocol buffer messages streamed in multiplatform binary form using bytes with variable length. Each messages is composed on “Event information” and “Particle information”. See the description below.
- At the very end, there is a “Statistics” Proto buffer message which can be have information on all events, accumulated cross sections etc. This is optional record in ProMC.
The ProMC is based on a random access, i.e. you can read the “Header”, “Statistics”, and any event record from any place in your program.
The data layouts inside ProMC files are implemented using the Google's Protocol Bufffes template files. Look at the language guide protocol-buffers used for such files. Such files can be used to generate analysis code in any programming language (C++,Java,Python.). There are a few files used to create and read ProMC files:
- The description message is Description.proto
- The header of the file record is ProMCHeader.proto. Typically the header file created before the main loop ever events.
- The MC event record (repeatable) ProMC.proto
- The statistics of the generated events is stored in ProMCStat.proto. This record is inserted last, after a MC accumulated statistics and final cross section calculation.
These are the files that are shipped with the default installation and suitable to keep truth MC information. A more complicated data layouts are given in examples/proto directory (to keep jets, leptons, jets with constituents etc.).
The proto files (ProMCHeader.proto, ProMC.proto, ProMCHeader.proto) can be embedded to the ProMC file record, making the file “self-describing”. It is recommended to embed such files, since one can later generate analysis code in any programming language using these files. This is ideal for preserving data and make future modifications without knowing the analysis code used to create the original data. See the tutorials for examples.
To embed the layout templates inside a ProMC file, simply make a directory “proto” and copy (or link) these files. In case if such files are embedded, you can retrieve proto files and generate C++/Java/Python code which will read the data using the data structure used to create the file.
Optionally, you can also include a log file to the ProMC files. If you have a file ”logfile.txt” in the same directory where you write ProMC files, it will be included to the ProMC record (and compressed).
Available ProMC commands
|promc_info <file>||analyzes show the description|
|promc_browser <file>||Start a Java browser and look at the records|
|promc_dump <file>||dump all the information|
|promc_extract <file> <out> N||extracts N events and save them to out.promc|
|promc_proto <file>||extracts self-description of a ProMC files and generates “proto” directory describing data inside the file|
|promc_code||generates source files using “proto” directory. C++ code is generated in the directory src/, while Java code in the directory java/src|
|promc_log <file>||extracts the log file “logfile.txt” (if attached to the ProMC file)|
|hepmc2promc <HEPMC input> <ProMC output> “description”||converts HepMC file to ProMC file|
|promc2hepmc <ProMC input> <HepMC output>||converts ProMC file to HEPMC file|
Here is a small Python script which read ProMC file and extract self-description (including embedded proto files and logfile)
import zipfile z = zipfile.ZipFile("out/output.promc", "r") print z.read("promc_nevents") # Nr of events in the file print z.read("promc_description") # description print z.read("ProMCHeader.proto") # embedded ProtoBuff templates used to describe messages print z.read("ProMC.proto") print z.read("ProMC.proto") print z.read("logfile.txt") # embedded logfile for filename in z.namelist(): # loop over all entries print filename #bytes = z.read(filename) #print len(bytes)
This a Python script. You can also read this info in Java and PHP, as long as you can read an entry inside a zip file.
One can also unzip a ProMC file as:
mv file.promc file.zip unzip file.zip
All messages will be dumped into separate files.
You can extract a given record/event using a random access capabilities of this format. This is trivial to do in Java. For C++ example, check the code in “examples/random_access”. Type make to compile it and run the code. You can see that we can extract the needed event using the method “event(index)”.
Reading data remotely
You can stream data from a remote server without downloading ProMC files. The easiest is to to use the Python reader (see the example in examples/python). Below we show to to read one single event (event=100) remotely using Python:
# Shows how to read a single event from a remote file. S.Chekanov import urllib2, cStringIO, zipfile url = "http://atlaswww1.hep.anl.gov/asc/snowmass2013/delphes36/TruthRecords/higgs14tev/pythia8/pythia8_higgs_1.promc" try: remotezip = urllib2.urlopen(url) zipinmemory = cStringIO.StringIO(remotezip.read()) zip = zipfile.ZipFile(zipinmemory) for fn in zip.namelist(): # print fn if fn=="100": data = zip.read(fn) print "Read event=100" except urllib2.HTTPError: print "no file"
In this example. “data” represents a ProMC event record. Look at the example in the example in examples/python how to print such info.
ProMC File Browser
You can look at events and other information stored in the ProMC files using a browser implemented in Java. It runs on Linux/Windows/Mac without any external libraries. First, get the browser:
And run it as (it assumes Java7 and above. Check it as “java -version”, it should show 1.7.X version)
java -jar browser_promc.jar
Now we can open a ProMC file. Let's get an example ProMC file which keeps 1,000 events generated by Pythia8:
Open this file in the browser as: [File]→[Open file]. Or you can open it using the prompt:
java -jar browser_promc.jar Pythia8.promc
This opens the file and shows the metadata (i.e. information stored in the header and statistics records):
On the left, you will see event numbers. Double click on any number. The browser will display the event record with all stored particles for this event (PID, Status,Px,Py,Pz, etc).
You can access metadata on particle data, such as information on particle types, PID and masses using the [Metadata]→[Particle data] menu. This record is common for all events (ProMC does not store particle names and masses for each event).
If the ProMC file was made “self-describing” and stores templates for proto layouts used to generate analysis code, you can open the “Data layout” menu:
This information can be used to generate analysis code and make future modification to the existing file. Use “promc_proto” command to extract such files, and “proto_code' to generate the analysis code. See the tutorial section.
You can look at event information (process ID, PDF, alphaS, weight) if you navigate with the mouse to the event number on the left, and click on the right button. You will see a pop-up menu. Select “Event information”.
You can plot histograms for the desired variables using several approaches:
Where ProMC is appropriate?
ProMC is used to store truth MC event records (about x2 more compact than compressed HEPMC files). ProMC is also used for Snowmass 2012-2013 to keep Delphes fast simulation files (including reconstructed jets and other objects). See the Snowmass web page. However, it is still in a prototype stage.
How to cite this work
The ProMC paper is coming. It will be also discussed in the Snowmass white paper. For now, site this work as:
S.Chekanov, "Next generation input-output file format for HEP data based on Google's protocol buffers" (2013) https://atlaswww.hep.anl.gov/asc/promc
ProMC is a rewrite of an older package (CBook) for the community supported jHepWork. Currently, this program has the name SCaVis. The current ProMC version is based on HEvent record format Examples and zipious++ library which was first publicly available since 2008. (S.C.).
ProMCE is licensed by the GNU General Public License v3 or later. Please refer GPL-v3.0
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
The text of this manual cannot be freely redistributed and is subject to the Creative Commons Attribution-Share Alike License; either version 3.0 of the License, or any later version. See By-SA. You are free to copy, distribute, transmit, and adapt ProMC under the following conditions:
- Attribution. You must attribute the ProMC work i.e. by linking this web page.
- Non-commercial. You may not use this work for commercial purposes
- Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
- More benchmarks. Check compression for different pT scenario. (Memory usage, speed)
- Write interfaces to PYTHIA, HERWIG (PYTHIA is almost done)
- Better buffering?
- Write interface for ROOT
- Write interface to Delphes fast simulation
- Finish JAVA and Python interfaces (Java is almost done)
- Programs to extract a given records and merge records (easy part)