Web data integration

From HandWiki

Web data integration (WDI) is the process of aggregating and managing data from different websites into a single, homogeneous workflow. This process includes data access, transformation, mapping, quality assurance and fusion of data. Data that is sourced and structured from websites is referred to as "web data". WDI is an extension and specialization of data integration that views the web as a collection of heterogeneous databases. Data integration techniques in the context of the web, forms the foundation for businesses taking advantage of data available on the ever-increasing number of publicly-accessible websites.[1] Corporate spending on this area amounted to about USD 2.5bn in 2017, and it is expected that by 2020 the market will reach almost USD 7bn.[2]

Sources

Web data integration extends and specializes data integration to see the web as a collection of views of databases accessible over the web protocols, including, but not limited to:[3]

  • Open data catalogs
  • Government data catalogs
  • Web applications and sites
    • UI (web scraping)
    • API
  • The semantic web (SPARQL)
  • HTML embedded structured data
  • HTML data tables
  • Spreadsheets
  • PDFs
  • Online encyclopedias

Data access and transformation

WDI has technical challenges different from data integration due to the data access and transformation required for the web data sources being often unstructured or semi-structured data without a standard query mechanism.

Data quality

Understanding the quality and veracity of data is even more important in WDI than in data integration, as the data is generally less implicitly trusted and of lower quality than that which is collected from a trusted source. There are attempts to try to automate a trust rating for web data.[4]

Data quality in data integration can generally happen after data access and transformation, but in WDI quality may need to be monitored as data is collected, due to both the time and the cost of re-collecting the data.[5]

Applications

WDI has application in many fields, including bioinformatics,[6] search engines,[7] price comparison,[8] and forensic search[9] data analysis, business intelligence, ecommerce,[10] healthcare, pharmaceutical[11] and product development.

Most price comparison engines and recommendation systems use user generated data to create recommendations for their users. Similarly, healthcare systems use results of competitions conducted on websites like Kaggle[12] to see the accuracy of data and to create user-focused products. In fact, IBM estimates that poor quality WDI is costing companies over $3 trillion[13] in revenue each year.

References

  1. "IE 670 Web Data Integration" (in en). 2019-01-24. https://www.uni-mannheim.de/dws/teaching/course-details/courses-for-master-candidates/ie-670-web-data-integration/. 
  2. "Opimas: The Web Data Extraction Market" (in en). http://www.opimas.com/research/355/detail/. 
  3. "Introduction :: Web Data Integration". https://www.webdataintegration.io/wdi/1.0/index.html. 
  4. Giménez-García, José M.; Thakkar, Harsh; Zimmermann, Antoine (2016). "Assessing Trust with PageRank in the Web of Data". in Sack, Harald; Rizzo, Giuseppe; Steinmetz, Nadine et al. (in en). The Semantic Web. Lecture Notes in Computer Science. 9989. Springer International Publishing. pp. 293–307. doi:10.1007/978-3-319-47602-5_45. ISBN 9783319476025. https://hal-emse.ccsd.cnrs.fr/emse-01310508/file/JMGFG2016.pdf. 
  5. The way to integrate data from E-commerce website platform
  6. "Web Data Integration". https://dbs.uni-leipzig.de/en/projekte/DATAINT/index.html. 
  7. "Web-scale Data Integration - You Can Only Afford to Pay as You Go". http://www.datascienceassn.org/content/web-scale-data-integration-you-can-only-afford-pay-you-go. 
  8. Siegel, Michael D.; Madnick, Stuart E.; Zhu, Hongwei (2008). "Enabling global price comparison through semantic integration of web data" (in en). International Journal of Electronic Business 6 (4): 319. doi:10.1504/IJEB.2008.020672. 
  9. "PwC buys Kusiri, London-based fraud detection start-up" (in en). 2015-10-30. https://www.consultancy.uk/news/2840/pwc-buys-kusiri-londonbased-fraud-detection-startup. 
  10. Osial, P.; Kauranen, K.; Ahmed, E. (April 2017). "Smartphone recommendation system using web data integration techniques". 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE). pp. 1–5. doi:10.1109/CCECE.2017.7946845. ISBN 978-1-5090-5538-8. https://ieeexplore.ieee.org/document/7946845/;jsessionid=GoG6IfE94h0FKAxRzD-ndyd2QUyOECVI7Pj9xc2zBt0s7mIuttf_!878820695. 
  11. "How Data Integration is Revamping Healthcare and Pharma" (in en-US). 2020-04-27. https://dataintegrationinfo.com/data-integration-in-healthcare-pharma/. 
  12. "Kaggle: Your Machine Learning and Data Science Community" (in en). https://www.kaggle.com/. 
  13. Import.io. "Web Data Integration: Revolutionizing the Way You Work with Web Data" (in en-US). https://www.import.io/post/web-data-integration-revolutionizing-the-way-you-work-with-web-data/.