Software:Apache Arrow

From HandWiki
Short description: Software framework
Apache Arrow
Developer(s)Apache Software Foundation
Initial releaseOctober 10, 2016; 7 years ago (2016-10-10)
Repositoryhttps://github.com/apache/arrow
Written inC, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust
TypeData format, algorithms
LicenseApache License 2.0
Websitearrow.apache.org

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[1][2][3][4][5] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[6]

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[1]

Applications

Arrow has been used in diverse domains, including analytics,[7] genomics,[8][6] and cloud computing.[9]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[10] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[11] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[12]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[13] with development led by a coalition of developers from other open source data analytics projects.[14][15][5][16][17] The initial codebase and Java library was seeded by code from Apache Drill.[13]

References

  1. 1.0 1.1 "Apache Arrow and Distributed Compute with Kubernetes". 13 Dec 2018. https://www.xenonstack.com/insights/what-is-apache-arrow/. 
  2. Baer, Tony (17 February 2016). "Apache Arrow: Lining Up The Ducks In A Row... Or Column". https://seekingalpha.com/article/3904056-apache-arrow-lining-up-ducks-in-row-column. 
  3. Baer, Tony (25 February 2019). "Apache Arrow: The little data accelerator that could". https://www.zdnet.com/article/apache-arrow-the-little-data-accelerator-that-could/. 
  4. Hall, Susan (23 February 2016). "Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark". https://thenewstack.io/apache-arrow-designed-accelerate-hadoop-spark-columnar-layouts-data/. 
  5. 5.0 5.1 Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld. https://www.infoworld.com/article/3033446/hadoop/apache-arrow-aims-to-speed-access-to-big-data.html. 
  6. 6.0 6.1 Tanveer Ahmad (2019). "ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework". bioRxiv: 741843. doi:10.1101/741843. https://www.biorxiv.org/content/10.1101/741843v1. 
  7. Dinsmore T.W. (2016). "In-Memory Analytics: Satisfying the Need for Speed". Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4. 
  8. "Scalable genomics: from raw data to aligned reads on Apache YARN". IEEE International Conference on Big Data: 1232–1241. 2016. https://www.biorxiv.org/content/biorxiv/early/2016/08/23/071092.full.pdf. 
  9. "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era". Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. 2017. doi:10.1145/3102980.3103003. 
  10. Le Dem, Julien. "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory". KDnuggets. https://www.kdnuggets.com/2017/02/apache-arrow-parquet-columnar-data.html. 
  11. "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?". 2017-10-31. http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html. 
  12. "PyArrow:Reading and Writing the Apache Parquet Format". https://arrow.apache.org/docs/python/parquet.html. 
  13. 13.0 13.1 "The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project". 17 February 2016. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces87. 
  14. Martin, Alexander J. (17 February 2016). "Apache Foundation rushes out Apache Arrow as top-level project". The Register. https://www.theregister.co.uk/2016/02/17/apache_arrow_toplevel_project/. 
  15. "Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says.". 2016-02-17. https://www.cio.com/article/3034279/big-data-gets-a-new-open-source-project-apache-arrow.html. 
  16. Le Dem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times. https://sdtimes.com/apache/guest-view-first-release-apache-arrow/. 
  17. "Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow.". https://www.infoq.com/news/2016/12/le-dem-apache-arrow/. 

External links