(written by S.Chekanov, ANL)
ProMC is a package for file input and output for structural event records (such as Monte Carlo or data events). The main features include:
ProMC(“ProtocolBuffers” MC) is based on Google's Protocol Buffers, language-neutral, platform-neutral and extensible mechanism for serializing structured data. It uses “varints” as a way to store and compress integers using one or more bytes. Smaller numbers take a smaller number of bytes. This means that low energetic particles can be represented by a smaller number of bytes since the values to store 4-momenta are smaller compared to high-energetic particles. This is important concept to store events with many soft particles (“pileup”) in the same event record since they use less disk space.
This project is tailored to HEP ANL BlueGene/Q project, since can provide a simple and efficient way to stream data from/to BlueGene/P.
The main idea behind ProMC is to use “content-dependent” compression to store particles depending on their importance. An 14 TeV pp collision event with 140 pileup soft events can have more than 10k particles. Most of them have low pT (“soft”). If we can encode 4-momenta using integer values, then soft particles can be represented by smaller values compared to most interesting (“hard”) particles. If we encode this information using Protocol buffers “variants”, we can use less bytes to store soft particles from pileup. Read Protocol-buffers Encoding.
However, Protocol buffers is still not sufficient, since it can be used write and read separate “messages” (i.e. “single events”). ProMC is designed to store multiple “messages” (or events in HEP terminology“) in a file using a platform neutral way. It also constructs a header for events and organize “messages” in a form suitable for event Monte Carlo records.
Example: A typical HEPMC file size for 100 ttbar events with 140 pileup events (14 TeV) is 1,230 MB. Gzipping this file reduces the file size to 445 MB (in which case it is impossible to read it). The main objective is to store such events in a platform-independent file of size of about 300 MB and still be able to read such data (even using random access). As you will see below, such goal has been achieved using the ProMC package.
When we say “compression”, we typically mean some compression algorithm. In ROOT, up to 50% of CPU is spent on compression/decompression of data. ProMC does not use any algorithm to compress or decompress files. It just streams data into a binary format without CPU overhead.
Each event of the ProMC library is a “ProtocolBuffer” message. Float values (pX,pY,pZ,M) are encoded as “int64” integers. In such representation, 0.01 MeV is minimum allowed energy, while 24 TeV is maximum allowed energy.
Here is a mapping table:
|Energy||Representation||How many bytes in encoding|
|0.01 MeV||1||1 bytes|
|0.1 MeV||10||1 bytes|
|1 MeV||100||2 bytes|
|1 GeV||100 000||4 bytes|
|1 TeV||100 000 000||8 bytes|
|20 TeV||2000 000 000||8 bytes|
Thus, a 4-momentum of a soft particle (~ MeV) can be represented by a reduced number of bytes, compared to fixed length encoding. For a typical pT spectra (falling distribution), this means that the bulk of particle spectra at low pT is compressed more effectively, than for particles at the high-pT tail. This compression depends on the pT spectra.
There are other places where the Google's compression is efficient for MC event records. For example, partons typically have small integer values of:
thus they are compressed more effectively using “varints” than final state or exotic particles with large PDG_ID number. Also, light particles (partons) will be compressed more effectively due to their small mass.
Another place where ProMC tries to optimize storage is to set masses for most common particles as a map in the header message of the record. For example, masses of pions and kaons can be set to 0 value (1 bit). During the reading, the masses are restored using the map stored in the header file.
A typical ProMC has 4 major ProtoBuff messages:
The ProMC is based on a random access, i.e. you can read the “Header”, “Statistics”, and any event record from any place in your program.
The data layouts inside ProMC files are implemented using the Google's Protocol Bufffes template files. Look at the language guide protocol-buffers used for such files. Such files can be used to generate analysis code in any programming language (C++,Java,Python.). There are a few files used to create and read ProMC files:
These are the files that are shipped with the default installation and suitable to keep truth MC information. A more complicated data layouts are given in examples/proto directory (to keep jets, leptons, jets with constituents etc.).
The proto files (ProMCHeader.proto, ProMC.proto, ProMCHeader.proto) can be embedded to the ProMC file record, making the file “self-describing”. It is recommended to embed such files, since one can later generate analysis code in any programming language using these files. This is ideal for preserving data and make future modifications without knowing the analysis code used to create the original data. See the tutorials for examples.
To embed the layout templates inside a ProMC file, simply make a directory “proto” and copy (or link) these files. In case if such files are embedded, you can retrieve proto files and generate C++/Java/Python code which will read the data using the data structure used to create the file.
Optionally, you can also include a log file to the ProMC files. If you have a file ”logfile.txt“ in the same directory where you write ProMC files, it will be included to the ProMC record (and compressed).
|promc_info <file>||analyzes show the description|
|promc_browser <file>||Start a Java browser and look at the records|
|promc_dump <file>||dump all the information|
|promc_extract <file> <out> N||extracts N events and save them to out.promc|
|promc_proto <file>||extracts self-description of a ProMC files and generates “proto” directory describing data inside the file|
|promc_code||generates source files using “proto” directory. C++ code is generated in the directory src/, while Java code in the directory java/src|
|promc_log <file>||extracts the log file “logfile.txt” (if attached to the ProMC file)|
|hepmc2promc <HEPMC input> <ProMC output> “description”||converts HepMC file to ProMC file|
|promc2hepmc <ProMC input> <HepMC output>||converts ProMC file to HEPMC file|
Here is a small Python script which read ProMC file and extract self-description (including embedded proto files and logfile)
import zipfile z = zipfile.ZipFile("out/output.promc", "r") print z.read("promc_nevents") # Nr of events in the file print z.read("promc_description") # description print z.read("ProMCHeader.proto") # embedded ProtoBuff templates used to describe messages print z.read("ProMC.proto") print z.read("ProMC.proto") print z.read("logfile.txt") # embedded logfile for filename in z.namelist(): # loop over all entries print filename #bytes = z.read(filename) #print len(bytes)
This a Python script. You can also read this info in Java and PHP, as long as you can read an entry inside a zip file.
One can also unzip a ProMC file as:
mv file.promc file.zip unzip file.zip
All messages will be dumped into separate files.
You can extract a given record/event using a random access capabilities of this format. This is trivial to do in Java. For C++ example, check the code in “examples/random_access”. Type make to compile it and run the code. You can see that we can extract the needed event using the method “event(index)”.
You can stream data from a remote server without downloading ProMC files. The easiest is to to use the Python reader (see the example in examples/python). Below we show to to read one single event (event=100) remotely using Python:
# Shows how to read a single event from a remote file. S.Chekanov import urllib2, cStringIO, zipfile url = "http://atlaswww1.hep.anl.gov/asc/snowmass2013/delphes36/TruthRecords/higgs14tev/pythia8/pythia8_higgs_1.promc" try: remotezip = urllib2.urlopen(url) zipinmemory = cStringIO.StringIO(remotezip.read()) zip = zipfile.ZipFile(zipinmemory) for fn in zip.namelist(): # print fn if fn=="100": data = zip.read(fn) print "Read event=100" except urllib2.HTTPError: print "no file"
In this example. “data” represents a ProMC event record. Look at the example in the example in examples/python how to print such info.
You can look at events and other information stored in the ProMC files using a browser implemented in Java. It runs on Linux/Windows/Mac without any external libraries. First, get the browser:
And run it as (it assumes Java7 and above. Check it as “java -version”, it should show 1.7.X version)
java -jar browser_promc.jar
Now we can open a ProMC file. Let's get an example ProMC file which keeps 1,000 events generated by Pythia8:
Open this file in the browser as: [File]→[Open file]. Or you can open it using the prompt:
java -jar browser_promc.jar Pythia8.promc
This opens the file and shows the metadata (i.e. information stored in the header and statistics records):
On the left, you will see event numbers. Double click on any number. The browser will display the event record with all stored particles for this event (PID, Status,Px,Py,Pz, etc).
You can access metadata on particle data, such as information on particle types, PID and masses using the [Metadata]→[Particle data] menu. This record is common for all events (ProMC does not store particle names and masses for each event).
If the ProMC file was made “self-describing” and stores templates for proto layouts used to generate analysis code, you can open the “Data layout” menu:
This information can be used to generate analysis code and make future modification to the existing file. Use “promc_proto” command to extract such files, and “proto_code' to generate the analysis code. See the tutorial section.
You can look at event information (process ID, PDF, alphaS, weight) if you navigate with the mouse to the event number on the left, and click on the right button. You will see a pop-up menu. Select “Event information”.
You can plot histograms for the desired variables using several approaches:
ProMC is used to store truth MC event records (about x2 more compact than compressed HEPMC files). ProMC is also used for Snowmass 2012-2013 to keep Delphes fast simulation files (including reconstructed jets and other objects). See the Snowmass web page. However, it is still in a prototype stage.
The ProMC paper is coming. It will be also discussed in the Snowmass white paper. For now, site this work as:
S.Chekanov, "Next generation input-output file format for HEP data based on Google's protocol buffers" (2013) https://atlaswww.hep.anl.gov/asc/promc
ProMC is a rewrite of an older package (CBook) for the community supported jHepWork. Currently, this program has the name SCaVis. The current ProMC version is based on HEvent record format Examples and zipious++ library which was first publicly available since 2008. (S.C.).
ProMCE is licensed by the GNU General Public License v3 or later. Please refer GPL-v3.0
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
The text of this manual cannot be freely redistributed and is subject to the Creative Commons Attribution-Share Alike License; either version 3.0 of the License, or any later version. See By-SA. You are free to copy, distribute, transmit, and adapt ProMC under the following conditions: