You are a guest. Restricted access. Read more.
SCaVis manual

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Previous revision
man:promc:introduction [2013/05/13 13:44]
man:promc:introduction [2013/06/13 20:07] (current)
admin
Line 1: Line 1:
 +[[man:​promc:​start.txt|<<​ back]]
 +
 +
 +==== ProMC User's Manual ====
 +(written by S.Chekanov, ANL)
 +
 +
 +
 +**ProMC** is a package for file input and output ​ for structural event records (such as Monte Carlo or data events). The main features include:
 +
 +
 +  * Compact file format based on a content-dependent "​compression"​ using Google'​s protocol buffers. See the discussion ​ [[asc:​promc#​how_particles_are_stored_in_promc|here]]. Although we use the word "​compression",​ it it just a very compact binary format (see the discussion below).
 +  * ProMC is not based on a compression algorithm (like gzip). ​ This is just a streaming of data into a binary wire format. ​ Therefore, no CPU overhead due to compression/​decompression is expected.
 +  * Self-describing file format. One can generate C+/​Java/​Python code to read or write files from existing ProMC data file and make future modifications without the language-specific code used to generate the original file.
 +  *  Multiplatform. Data records can be read and write  in C++, Java and Python. PHP can be used to access event records.
 +  * Forward ​ and backward compatible binary format.
 +  * Random access. Events can be read starting at any index. ​
 +  * Optimized for parallel computation.
 +  * Metadata can be encoded for each record which will allow fast access of interesting events
 +  * Logfile can be embedded into the ProMC files. Useful for Monte Carlo generators.
 +  * No external dependences. ​ The library is small and does not depend on ROOT or any other libraries.
 +  * Events ​ can be read from remote files (using random access) ​
 +  * Well suited ​ for archiving data. In addition of being very compact, any future modification can be made by generating analysis code using "​self-describing"​ property. ​
 +
 +
 +**ProMC**("​ProtocolBuffers"​ MC)  is based on Google'​s [[https://​developers.google.com/​protocol-buffers| Protocol Buffers]],
 +language-neutral,​ platform-neutral and extensible mechanism for serializing structured data.
 +It uses "​varints"​ as a way  to  store and compress integers using one or more bytes.
 +Smaller numbers take a smaller number of bytes. This means that low energetic particles
 +can be represented by a smaller number of bytes since the values to store 4-momenta are smaller compared to high-energetic particles.
 +This is important concept to store events with many soft particles ("​pileup"​) in the same event record since they use less disk space.
 +
 +This project is tailored to HEP  ANL BlueGene/Q project, since can provide a simple and efficient way to stream data from/to BlueGene/​P.  ​
 +
 +
 +===== Why ProMC? =====
 +
 +The main idea behind ProMC is to use "​content-dependent"​ compression to store particles depending on their importance. An 14 TeV pp collision event with 140 pileup soft events can have more than 10k particles. Most of them  have low pT ("​soft"​). If we can encode 4-momenta ​ using integer values, ​
 +then soft particles can be represented by smaller ​ values compared to most interesting ("​hard"​) particles.
 +If we encode this information using Protocol buffers "​variants",​ we can use **less bytes to store soft particles from pileup**.
 +Read [[https://​developers.google.com/​protocol-buffers/​docs/​encoding| Protocol-buffers Encoding]].
 +
 +However, Protocol buffers is still not sufficient, since it can be used write and read separate "​messages"​ (i.e. "​single events"​). ProMC is designed to store multiple ​ "​messages"​ (or events in HEP terminology"​) in a file using a platform neutral way. It also constructs a header for events and organize "​messages"​ in a form suitable for event Monte Carlo records. ​
 +
 +//​Example://​ A typical HEPMC  file size for 100  ttbar events with 140 pileup events (14 TeV) is **1,230** MB. Gzipping this file  reduces the file size to **445 MB** (in which case it is impossible to read it). The main objective is to store such events in a platform-independent file of size of about **300 MB** and still be able to read such data (even using random access). As you will see below, such goal has been  achieved using the ProMC package.  ​
 +
 +===== About the data compression=====
 +
 +When we say "​compression",​ we typically mean some compression algorithm. In ROOT,  up to 50% of CPU is spent on compression/​decompression of data. ProMC does not use any algorithm to compress or decompress files. It just streams data into a binary format without CPU overhead.  ​
 +
 +===== Other approaches =====
 +
 +  * [[http://​lcgapp.cern.ch/​project/​simu/​HepMC/​| HepMC]] is a poplar, platform independent way to store event record. The most common approach is save records in  ASCII files. A typical size of data for 100 pp collision events (14 TeV) with 140 pileup events is 1.2 GB. A gzip compression ​ can reduce it to 450 MB (mainly for storage). But there is no way to read and write compressed files. Despite large size of HEPMC files, the  main advantage ​ is that  it's multiplatform and human readable. ​
 +  * [[http://​cepa.fnal.gov/​psm/​stdhep/​c++/​| StdHep]] C++ library is no longer supported. It uses a gzip compression,​ bit not at the byte levels ​ as in case of vaints. The format is not multiplatform.
 +  * [[http://​adsabs.harvard.edu/​abs/​2010arXiv1001.2576B|LHEF]] is another event record. It is XML based and inherits problems which were solved by Google. Protocol buffers have many advantages over XML for serializing structured data:  3 to 10 times smaller (assuming no zip compression) and 20 to 100 times faster.
 +  * Another way to store is to use [[http://​root.cern.ch/​drupal/​|ROOT]]. The format is not fully multiplatform (although attempts were done to read using Java). ​ ROOT uses the standard standard "​gzip"​ compression ​ and fixed-precision values. ​ This means that soft and hard particles takes same storage since represented by a fixed number of bytes.
 +
 +
 +===== Particle'​s representation =====
 +
 +Each event of the ProMC library is a "​ProtocolBuffer"​ message. ​ Float values (pX,​pY,​pZ,​M) are encoded as "​int64"​ integers. In such representation,​
 + 0.01 MeV is minimum allowed energy, while 24 TeV is maximum allowed energy.
 +
 +Here is a mapping table:
 +
 +
 +^ Energy ​      ​^ ​   Representation ​  ​^ ​ How many bytes in encoding ​     ^
 +| 0.01 MeV     ​| ​  ​1 ​                | 1 bytes                          |
 +| 0.1  MeV     ​| ​  ​10 ​               | 1 bytes                          |
 +| 1 MeV        |   ​100 ​              | 2 bytes                          |
 +| 1 GeV        |   100 000           | 4 bytes                          |
 +| 1 TeV        |   100 000 000       | 8 bytes                          |
 +| 20 TeV       ​| ​ 2000 000 000       | 8 bytes                          |
 +
 +Thus, a 4-momentum of a soft particle (~ MeV) can be represented by a reduced number of bytes, compared to fixed length encoding. ​ For a typical pT spectra (falling distribution),​ this means that the bulk of particle spectra at low pT  is compressed more effectively,​ than for particles at the high-pT tail. This compression depends on the pT spectra.
 +
 +
 +<​note>​
 +ProMC keeps units in the header file. One can always redesign the units as you want, emphasizing a better precision for low-momentum particles compared to high-pT particles</​note>​
 +
 +
 +There are other places where the Google'​s compression is efficient for MC event records. For example, partons typically have small integer values of:
 +
 +  * PDG_ID ​     - PDG ID
 +  * status ​     - status code
 +  * daughter1 ​  - 1st daughter
 +  * daughter2 ​  - 2nd daughter
 +  * mother1 ​    - 1st mother
 +  * mother2 ​    - 2nd mother
 +
 +
 +thus they are  compressed more effectively using  "​varints"​ than final state or exotic particles with large PDG_ID number. Also, light particles (partons) will be compressed more effectively due to their small mass.
 +
 +
 +Another place where ProMC tries to optimize storage is to set masses for most common particles as a map in the header message of the record. For example, masses of pions and kaons can be set to 0 value (1 bit). During the reading, the masses are restored using the map stored in the header file.  ​
 +
 +
 +=====  ProMC record layouts =====
 +
 +A typical ProMC has 4 major ProtoBuff messages:
 +
 +  - File description message ("file metadata"​) with timestamp, version, Nr of events, description.
 +  - A header "​message"​ to keep MC event  metadata. It keeps global information about initial colliding particles, PDF, cross section, etc. It also keeps track of which units are used to convert floats to integers. In addition, ​ it keeps information on PDG particles (particle ID, masses, names, charges, etc.). ​ The header is encoded as a separate Proto buffer message an is suppose to provide the necessary description of events.
 +  - Events as separate Protocol buffer messages streamed in multiplatform binary form using bytes with variable length. Each messages is composed on "Event information"​ and "​Particle information"​. See the description below.
 +  - At the very end, there is a "​Statistics"​ Proto buffer message which  can be have information on all events, accumulated cross sections etc. This is optional record in ProMC.
 +
 +The ProMC is based on a random access, i.e. you can read the "​Header",​ "​Statistics",​ and any event record from any place in your program.
 +
 +The  data layouts inside ProMC files are implemented using the Google'​s Protocol Bufffes
 +template files. Look at the language guide
 +[[https://​developers.google.com/​protocol-buffers/​docs/​proto| protocol-buffers]] used for such files.
 +Such files can be used to generate analysis code in any programming language (C++,​Java,​Python.). ​
 +There are a few files used to create and read ProMC files:
 +
 +  * The description message is [[http://​atlaswww.hep.anl.gov/​asc/​WebSVN/​filedetails.php?​repname=ProMC&​path=%2FProMC%2Ftrunk%2Fproto%2Fpromc%2FProMCDescription.proto|Description.proto]]
 +  * The header of the file record is  [[http://​atlaswww.hep.anl.gov/​asc/​WebSVN/​filedetails.php?​repname=ProMC&​path=%2FProMC%2Ftrunk%2Fproto%2Fpromc%2FProMCHeader.proto|ProMCHeader.proto]]. Typically the header file created before the main loop ever events.
 +  * The MC event record (repeatable) [[http://​atlaswww.hep.anl.gov/​asc/​WebSVN/​filedetails.php?​repname=ProMC&​path=%2FProMC%2Ftrunk%2Fproto%2Fpromc%2FProMC.proto|ProMC.proto]]
 +  * The statistics of the generated events is stored in  [[http://​atlaswww.hep.anl.gov/​asc/​WebSVN/​filedetails.php?​repname=ProMC&​path=%2FProMC%2Ftrunk%2Fproto%2Fpromc%2FProMCStat.proto|ProMCStat.proto]]. This record is inserted last, after a MC accumulated statistics and final cross section calculation. ​
 +
 +These are the files that are shipped with the default installation and suitable to keep truth MC information. A more complicated data layouts are given in examples/​proto directory (to keep jets, leptons, jets with constituents etc.).
 +
 +
 +The proto files (//​ProMCHeader.proto,​ ProMC.proto,​ ProMCHeader.proto//​) can be embedded to the ProMC
 +file record, making the file "​self-describing"​. It is recommended to embed such files, since one can later generate analysis code in any programming language using these files. This is ideal for preserving data and make future modifications without knowing the analysis code used to create the original data.
 +See the tutorials for examples.  ​
 +
 +To embed the  layout templates inside a ProMC file, simply make a directory "​proto"​ and copy (or link) these files. In case if such files are embedded, you can retrieve proto files and generate C++/​Java/​Python code which will read the data using the data structure used to create the file.
 +
 +Optionally, you can also include a log file to the ProMC files. If you have a file "//​logfile.txt//"​
 +in the same directory where you write ProMC files, it will be included to the ProMC record (and compressed).
 +
 +
 +
 +
 +
 +
 +
 +==== Available ProMC commands ====
 +^      ProMC Commands ​             ^  Description ​                   ^
 +|   ​promc_info <​file> ​               | analyzes show the description ​  |
 +|   ​promc_browser <​file> ​            | Start a Java browser and look at the records ​  |
 +|   ​promc_dump <​file> ​               | dump all the information ​       |
 +|   ​promc_extract <​file>​ <out> N     | extracts N events and save them to out.promc |
 +|   ​promc_proto ​ <​file> ​             | extracts self-description of a ProMC files and generates "​proto" ​ directory describing data inside the file |
 +|   ​promc_code ​                      | generates source files using "​proto"​ directory. C++ code is generated in the directory src/, while Java code in the directory java/src |
 +|   ​promc_log <​file> ​                | extracts the log file "​logfile.txt"​ (if attached to the ProMC file) |
 +|   ​hepmc2promc <HEPMC input> ​ <ProMC output> "​description" ​              | converts HepMC file to ProMC file |
 +|   ​promc2hepmc <ProMC input> ​ <HepMC output> ​               | converts ProMC file to HEPMC file |
 +
 +
 +
 +
 +Here is a small Python script which read ProMC file and extract self-description
 +(including embedded ​ proto  files and logfile)
 +<code python>
 +import zipfile
 +
 +z = zipfile.ZipFile("​out/​output.promc",​ "​r"​)
 +print z.read("​promc_nevents"​) ​    # Nr of events in the file
 +print z.read("​promc_description"​) # description
 +print z.read("​ProMCHeader.proto"​) # embedded ProtoBuff templates used to describe messages
 +print z.read("​ProMC.proto"​)
 +print z.read("​ProMC.proto"​)
 +print z.read("​logfile.txt"​) ​ # embedded logfile
 +
 +for filename in z.namelist():​ # loop over all entries
 +        print filename
 +        #bytes = z.read(filename)
 +        #print len(bytes)
 +</​code>​
 +
 +This a Python script. You can also read this info in Java and PHP, as long as you can read an entry inside a zip file.
 +
 +One can also unzip a ProMC file as:
 +
 +<code bash>
 +mv file.promc file.zip
 +unzip file.zip
 +</​code>​
 +
 +All messages will be dumped into separate files.
 +
 +==== Random access====
 +
 +You can extract a given record/​event using a random access capabilities of this format.
 +This is trivial to do in Java.
 +For C++ example, check the code in "​examples/​random_access"​. Type make to compile it and run the code.
 +You can see that we can extract the needed event using the method "​event(index)"​.
 +
 +
 +==== Reading data remotely ====
 +
 +You can stream data from a remote server without downloading ProMC files. The easiest is to to use the Python reader (see the example ​
 +in examples/​python). Below we show to to read one single event (event=100) remotely using Python:
 +
 +
 +<code python file.py>
 +# Shows how to read a single event from a remote file. S.Chekanov
 +import urllib2, cStringIO, zipfile
 +url = "​http://​atlaswww1.hep.anl.gov/​asc/​snowmass2013/​delphes36/​TruthRecords/​higgs14tev/​pythia8/​pythia8_higgs_1.promc"​
 +try:
 +    remotezip = urllib2.urlopen(url)
 +    zipinmemory = cStringIO.StringIO(remotezip.read())
 +    zip = zipfile.ZipFile(zipinmemory)
 +    for fn in zip.namelist():​
 +        # print fn
 +        if fn=="​100":​
 +             data = zip.read(fn)
 +             print "Read event=100"​
 +except urllib2.HTTPError:​
 +       print "no file"
 +</​code>​
 +
 +In this example. "​data"​ represents a ProMC event record. Look at the example in the example ​
 +in examples/​python how to print such info.
 +
 +==== ProMC File Browser====
 +
 +You can look at events and other information stored in the ProMC files using a browser implemented in Java.
 +It runs on Linux/​Windows/​Mac without any external libraries. First, get the browser:
 +
 +<code bash>
 +wget  http://​atlaswww.hep.anl.gov/​asc/​promc/​download/​browser_promc.jar
 +</​code>​
 +
 +And run it as  (it assumes Java7 and above. Check it as "java -version",​ it should show 1.7.X version)
 +
 +<code bash>
 +java -jar browser_promc.jar
 +</​code>​
 +
 +Now we can open a ProMC file. Let's get an example ProMC file which keeps 1,000 events generated by Pythia8:
 +
 +<code bash>
 +wget  http://​atlaswww.hep.anl.gov/​asc/​promc/​download/​Pythia8.promc
 +</​code>​
 +Open this file in the browser as: [File]->​[Open file]. Or you can open it using the prompt:
 +
 +<code bash>
 +java -jar browser_promc.jar Pythia8.promc
 +</​code>​
 +
 +This opens the file and shows the metadata (i.e. information stored in the header and statistics records):
 +
 +{{:​asc:​promc:​screenshot_from_2013-05-11_21_43_06.png}}
 +
 +On the left, you will see event numbers. Double click on any number. The browser will display the event record with all stored particles for this event (PID, Status,​Px,​Py,​Pz,​ etc).
 +
 +{{:​asc:​promc:​screenshot_from_2013-05-11_21_43_55.png}}
 +
 +
 +You can access metadata on particle data, such as information on particle types, PID and masses using the [Metadata]->​[Particle data] menu. This record is common for all events (ProMC does not store particle names and masses for each event).
 +
 +{{:​asc:​promc:​screenshot_from_2013-05-11_21_43_34.png}}
 +
 +
 +If the ProMC file was made "​self-describing"​ and stores templates for proto layouts used to generate analysis
 +code, you can open the "Data layout"​ menu:
 +
 +{{:​asc:​promc:​screenshot_from_2013-05-11_21_44_10.png}}
 +
 +This information can be used to generate analysis code and make future modification to the existing file.
 +Use "​promc_proto"​ command to extract such files, and "​proto_code'​ to generate the analysis code.
 +See the tutorial section.
 +
 +You can look at event information (process ID, PDF, alphaS, weight) if you navigate with the mouse to the event number on the left, and click on the right button. You will see a pop-up menu. Select "Event information"​. ​
 +
 +==== Visualizing data ====
 +
 +You can plot histograms for the desired variables using several approaches:
 +
 +
 +  * Use [[http://​root.cern.ch/​ | ROOT ]] classes and methods if you read data in C++
 +  * Use  [[http://​root.cern.ch/​ | ROOT ]] classes and methods if you read data in Python. Use PyROOT. You can also use any other visualization ​ framework in Python, such as [[http://​matplotlib.org/​|MatPlotLib]]
 +  * Use [[http://​jwork.org/​scavis | ScaVis]] classes and methods if you read data using Java
 +  * Use [[http://​jwork.org/​scavis | ScaVis]] classes and methods if you read data using [[http://​www.jython.org/​| Jython]] (Python implemented in Java)
 +
 +
 +
 +
 +
 +==== Where  ProMC is appropriate?​====
 +
 +ProMC is used to store truth MC event records (about x2 more compact than compressed HEPMC files).
 +ProMC is also used for Snowmass 2012-2013 to keep Delphes fast simulation files (including reconstructed jets and other objects). ​
 +See the [[snowmass2013:​analyse_d36_promc| Snowmass web page]]. ​
 +However, it is still in a prototype stage. ​
 +
 +==== How to cite this work ====
 +
 +The ProMC  paper is coming. It will be also discussed in the Snowmass white paper. For now, site this work as:
 +
 +<​code>​
 +S.Chekanov, "Next generation input-output file format for HEP data based on Google'​s protocol buffers"​ (2013)
 +https://​atlaswww.hep.anl.gov/​asc/​promc
 +</​code>​
 +
 +==== History====
 +
 +ProMC is a rewrite of an older package (CBook) ​
 +for the community supported [[http://​jwork.org/​jhepwork/​|jHepWork]].
 +Currently, this program has the name [[http://​jwork.org/​scavis/​|SCaVis]]. ​ The current ProMC version is based on HEvent record format [[http://​jwork.org/​scavis/​examplesHEP/​|Examples]] and zipious++ library which was first publicly available since  2008.  (S.C.).
 +
 +==== License====
 +
 +ProMCE is licensed by the GNU General Public License v3 or later. Please refer [[http://​www.gnu.org/​licenses/​gpl-3.0.html|GPL-v3.0]]
 +
 +<​code>​
 +   This program is free software; you can redistribute it and/or modify it
 +   under the terms of the GNU General Public License as published by the
 +   Free Software Foundation; either version 3 of the License, or any later
 +   ​version. This program is distributed in the hope that it will be
 +   ​useful,​ but WITHOUT ANY WARRANTY; without even the implied warranty of
 +   ​MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 +   ​General Public License for more details.
 +</​code>​
 +The text of this manual cannot be freely redistributed and is subject to the Creative Commons Attribution-Share Alike License; either version 3.0 of the License, or any later version. See [[http://​creativecommons.org/​licenses/​by-sa/​3.0/​| By-SA]]. You are free to copy, distribute, transmit, and adapt ProMC under the following conditions:
 +
 +  * Attribution. You must attribute the ProMC work i.e. by linking this web page.
 +  * Non-commercial. You may not use this work for commercial purposes
 +  * Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
 +
 +
 +==== Ongoing work ====
 +
 +  * More benchmarks. Check compression for different pT scenario. (Memory usage, speed)
 +  * Write interfaces to PYTHIA, HERWIG (PYTHIA is almost done)
 +  * Better buffering? ​
 +  * Write interface ​ for ROOT
 +  * Write interface to Delphes fast simulation
 +  * Finish JAVA and Python interfaces (Java is almost done)
 +  * Programs to extract a given records and merge records (easy part)
 +  * Documentation
 +  * Optimization
 +
 +
 +
 +
 +
 +   
 +
  
man/promc/introduction.txt ยท Last modified: 2013/06/13 20:07 by admin
CC Attribution-Share Alike 3.0 Unported
Powered by PHP Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0 Valid HTML5