Feature engineering

From HandWiki
Short description: Extracting features from raw data for machine learning

Feature engineering or feature extraction or feature discovery is the process of extracting features (characteristics, properties, attributes) from raw data to support training a downstream statistical model.[1]

Other examples of features in physics include the construction of dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, the Archimedes number in sedimentation, and construction of first approximations of the solution such as analytical strength of materials solutions in mechanics.[2]

Relevance

Features vary in significance.[3] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).[4]

Explosion

Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:

  • Feature templates - implementing feature templates instead of coding new features
  • Feature combinations - combinations that cannot be represented by a linear system

Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection.[5]

Automation

Automation of feature engineering is a research topic that dates back to the 1990s.[6] Machine learning software that incorporates automated feature engineering has been commercially available since 2016.[7] Related academic literature can be roughly separated into two types:

  • Multi-relational decision tree learning (MRDTL) uses a supervised algorithm that is similar to a decision tree.
  • Deep Feature Synthesis uses simpler methods.[citation needed]

Multi-relational decision tree learning (MRDTL)

MRDTL generates features in the form of SQL queries by successively adding clauses to the queries.[citation needed] For instance, the algorithm might start out with

SELECT COUNT(*) FROM ATOM t1 LEFT JOIN MOLECULE t2 ON t1.mol_id = t2.mol_id GROUP BY t1.mol_id

The query can then successively be refined by adding conditions, such as "WHERE t1.charge <= -0.392".[citation needed]

However, most MRDTL studies base implementations on relational databases, which results in many redundant operations. These redundancies can be reduced by using techniques such as tuple id propagation.[8][9] Efficiency can be increased by using incremental updates, which eliminates redundancies.[10][promotional source?]

Open-source implementations

There are a number of open-source libraries and tools that automate feature engineering on relational data and time series:

  • featuretools is a Python library for transforming time series and relational data into feature matrices for machine learning.[11][12][13]
  • OneBM or One-Button Machine combines feature transformations and feature selection on relational data with feature selection techniques.[14]
  • getML community is an open source tool for automated feature engineering on time series and relational data.[16][17] It is implemented in C/C++ with a Python interface.[18] It has been shown to be at least 60 times faster than tsflex, tsfresh, tsfel, featuretools or kats.[19]
  • tsfresh is a Python library for feature extraction on time series data.[20] It evaluates the quality of the features using hypothesis testing.[21]
  • tsflex is an open source Python library for extracting features from time series data.[22] Despite being 100% written in Python, it has been shown to be faster and more memory efficient than tsfresh, seglearn or tsfel.[23]
  • seglearn is an extension for multivariate, sequential time series data to the scikit-learn Python library.[24]
  • tsfel is a Python package for feature extraction on time series data.[25]
  • kats is a Python toolkit for analyzing time series data.[26]

Deep feature synthesis

The deep feature synthesis (DFS) algorithm beat 615 of 906 human teams in a competition.[27][28]

Feature stores

The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.[29]

A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[30]

Feature stores can be standalone software tools or built into machine learning platforms.

Alternatives

Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error.[31][32] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering.[33] However, it's important to note that deep learning algorithms still require careful preprocessing and cleaning of the input data.[34] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.[35]

See also

References

  1. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009) (in en). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. ISBN 978-0-387-84884-6. https://books.google.com/books?id=eBSgoAEACAAJ. 
  2. SOLID-LIQUID MIXING IN STIRRED TANKS : Modeling, Validation, Design Optimization and Suspension Quality Prediction (Report). 2021. doi:10.13140/RG.2.2.11074.84164/1. https://www.researchgate.net/publication/353947052. 
  3. "Feature Engineering". 2010-04-22. http://www.cs.princeton.edu/courses/archive/spring10/cos424/slides/18-feat.pdf. 
  4. "Feature engineering and selection". Alexandre Bouchard-Côté. October 1, 2009. http://www.cs.berkeley.edu/~jordan/courses/294-fall09/lectures/feature/slides.pdf. 
  5. "Feature engineering in Machine Learning". Zdenek Zabokrtsky. https://ufal.mff.cuni.cz/~zabokrtsky/courses/npfl104/html/feature_engineering.pdf. 
  6. "Multi-relational Decision Tree Induction". Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 1704. 1999. pp. 378–383. doi:10.1007/978-3-540-48247-5_46. ISBN 978-3-540-66490-1. https://link.springer.com/content/pdf/10.1007/978-3-540-48247-5_46.pdf. 
  7. "Its all about the features". September 2017. https://reality.ai/it-is-all-about-the-features/. 
  8. "CrossMine: Efficient classification across multiple database relations". Proceedings. 20th International Conference on Data Engineering. 2004. pp. 399–410. doi:10.1109/ICDE.2004.1320014. ISBN 0-7695-2065-0. 
  9. "A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions". Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science. 4702. 2007. pp. 430–437. doi:10.1007/978-3-540-74976-9_43. ISBN 978-3-540-74975-2. 
  10. "How automated feature engineering works - The most efficient feature engineering solution for relational data and time series". https://get.ml/resources/how-getml-works. 
  11. "What is Featuretools?". https://featuretools.alteryx.com/en/stable/. 
  12. "Featuretools - An open source python framework for automated feature engineering". https://www.featuretools.com. 
  13. "github: alteryx/featuretools". https://github.com/alteryx/featuretools. 
  14. Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv:1706.00327 [cs.DB].
  15. Thanh Lam, Hoang; Thiebaut, Johann-Michael; Sinn, Mathieu; Chen, Bei; Mai, Tiep; Alkan, Oznur (2017-06-01). "One button machine for automating feature engineering in relational databases". arXiv:1706.00327 [cs.DB].
  16. "getML documentation". https://docs.getml.com/latest/. 
  17. "github: getml/getml-community". https://github.com/getml/getml-community. 
  18. "github: getml/getml-community". https://github.com/getml/getml-community. 
  19. "github: getml/getml-community". https://github.com/getml/getml-community. 
  20. "tsfresh documentation". https://tsfresh.readthedocs.io/en/latest. 
  21. "Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package)". https://www.researchgate.net/publication/324948288. 
  22. "predict-idlab/tsflex". https://github.com/predict-idlab/tsflex. 
  23. Van Der Donckt, Jonas; Van Der Donckt, Jeroen; Deprost, Emiel; Van Hoecke, Sofie (2022). "tsflex: Flexible time series processing & feature extraction". SoftwareX 17: 100971. doi:10.1016/j.softx.2021.100971. Bibcode2022SoftX..1700971V. https://www.sciencedirect.com/science/article/pii/S2352711021001904. Retrieved September 7, 2022. 
  24. "seglearn user guide". https://dmbee.github.io/seglearn/user_guide.html. 
  25. "Welcome to TSFEL documentation!". https://tsfel.readthedocs.io/en/latest/. 
  26. "github: facebookresearch/Kats". https://github.com/facebookresearch/Kats. 
  27. "Automating big-data analysis". 16 October 2015. https://news.mit.edu/2015/automating-big-data-analysis-1016. 
  28. Kanter, James Max; Veeramachaneni, Kalyan (2015). "Deep feature synthesis: Towards automating data science endeavors". 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA). pp. 1–10. doi:10.1109/DSAA.2015.7344858. ISBN 978-1-4673-8272-4. 
  29. "What is a feature store". https://www.featurestore.org/what-is-a-feature-store. 
  30. "An Introduction to Feature Stores". https://phaseai.com/resources/intro-to-feature-stores. 
  31. "Feature Engineering in Machine Learning" (in en-us). https://www.section.io/engineering-education/feature-engineering-in-machine-learning/. 
  32. explorium_admin (2021-10-25). "5 Reasons Why Feature Engineering is Challenging" (in en). https://www.explorium.ai/blog/5-reasons-why-feature-engineering-is-challenging/. 
  33. Spiegelhalter, D. J. (2019). The art of statistics : learning from data. [London] UK. ISBN 978-0-241-39863-0. OCLC 1064776283. https://www.worldcat.org/oclc/1064776283. 
  34. "Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions". SN Computer Science 2 (6): 420. November 2021. doi:10.1007/s42979-021-00815-1. PMID 34426802. 
  35. Bengio, Yoshua (2012), "Practical Recommendations for Gradient-Based Training of Deep Architectures", Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, 7700, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 437–478, doi:10.1007/978-3-642-35289-8_26, ISBN 978-3-642-35288-1, http://dx.doi.org/10.1007/978-3-642-35289-8_26, retrieved 2023-03-21 

Further reading

  • "Feature & Target Engineering". Hands-On Machine Learning with R. Chapman & Hall. 2019. pp. 41–75. ISBN 978-1-138-49568-5. 
  • Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O'Reilly. 2018. ISBN 978-1-4919-5324-2. 
  • "Data Engineering and Data Shaping". Practical Data Science with R (2nd ed.). Manning. 2020. pp. 113–160. ISBN 978-1-61729-587-4.