OVH Groupe : A journey into the wondrous land of Machine Learning, or Cleaning data is funnier than cleaning my flat! (Part 3) – Marketscreener.com
What am I doing here? The story so far
As you might know if you have read our blog for more than a year, a few years ago, I bought a flat in Paris. If you don't know, the real estate market in Paris is expensive but despite that, it is so tight that a good flat at a correct price can be for sale for less than a day.
Obviously, you have to take a decision quite fast, and considering the prices, you have to trust your decision. Of course, to trust your decision, you have to take your time, study the market, make some visits etc This process can be quite long (in my case it took a year between the time I decided that I wanted to buy a flat and the time I actually commited to buying my current flat), and even spending a lot of time will never allow you to have a perfect understanding of the market. What if there was a way to do that very quickly and with a better accuracy than with the standard process?
As you might also know if you are one of our regular readers, I tried to solve this problem with Machine Learning, using an end-to-end software called Dataiku. In a first blog post, we learned how to make a basic use of Dataiku, and discovered that just knowing how to click on a few buttons wasn't quite enough: you had to bring some sense in your data and in the training algorithm, or you would find absurd results.
In a second entry, we studied a bit more the data, tweaked a few parameters and values in Dataiku's algorithms and trained a new model. This yielded a much better result, and this new model was - if not accurate - at least relevant: the same flat had a higher predicted place when it was bigger or supposedly in a better neighbourhood. However, it was far from perfect and really lacked accuracy for several reasons, some of them out of our control.
However, all of this was done on one instance of Dataiku - a licensed software - on a single VM. There are multiple reasons that could push me to do things differently:
What we did very intuitively (and somewhat naively) with Dataiku was actually a quite complex pipeline that is often called ELT, for Extract, Load and Transform.
And obviously, after this ELT process, we added a step to train a model on the transformed data.
So what are we going to do to redo all of that without Dataiku's help?
When ELT becomes ELTT
Now that we know what we are going to do, let us proceed!
Before beginning, we have to properly set up our environment to be able to launch the different tools and products. Throughout this tutorial, we will show you how to do everything with CLIs. However, all these manipulations can also be done on OVHcloud's manager (GUI), in which case you won't have to configure these tools.
For all the manipulations described in the next phase of this article, we will use a Virtual Machine deployed in OVHcloud's Public Cloud that will serve as the extraction agent to download the raw data from the web and push it to S3 as well as a CLI machine to launch data processing and notebook jobs. It is a d2-4 flavor with 4GB of RAM, 2 vCores and 50 GB of local storage running Debian 10, deployed in Graveline's datacenter. During this tutorial, I run a few UNIX commands but you should easily be able to adapt them to whatever OS you use if needed. All the CLI tools specific to OVHcloud's products are available on multiple OSs.
You will also need an OVHcloud NIC (user account) as well as a Public Cloud Project created for this account with a quota high enough to deploy a GPU (if that is not the case, you will be able to deploy a notebook on CPU rather than GPU, the training phase will juste take more time). To create a Public Cloud project, you can follow these steps.
Here is a list of the CLI tools and other that we will use during this tutorial and why:
Additionally you will find commented code samples for the processing and training steps in this Github repository.
In this tutorial, we will use several object storage buckets. Since we will use the S3 API, we will call them S3 bucket, but as mentioned above, if you use OVHcloud standard Public Cloud Storage, you could also use the Swift API. However, you are restricted to only the S3 API if you use our new high-performance object storage offer, currently in Beta.
For this tutorial, we are going to create and use the following S3 buckets:
To create these buckets, use the following commands after having configured your aws CLI as explained above:
Now that you have your environment set up and your S3 buckets ready, we can begin the tutorial!
First, let us download the data files directly on Etalab's website and unzip them:
You should now have the following files in your directory, each one corresponding to the French real estate transaction of a specific year:
Now, use the S3 CLI to push these files in the relevant S3 bucket:
You should now have those 5 files in your S3 bucket:
What we just did with a small VM was ingesting data into a S3 bucket. In real-life usecases with more data, we would probably use dedicated tools to ingest the data. However, in our example with just a few GB of data coming from a public website, this does the trick.
Now that you have your raw data in place to be processed, you just have to upload the code necessary to run your data processing job. Our data processing product allows you to run Spark code written either in Java, Scala or Python. In our case, we used Pyspark on Python. Your code should consist in 3 files:
Once you have your code files, go to the folder containing them and push them on the appropriate S3 bucket:
Your bucket should now look like that:
You are now ready to launch your data processing job. The following command will allow you to launch this job on 10 executors, each with 4 vCores and 15 GB of RAM.
Note that the data processing product uses the Swift API to retrieve the code files. This is totally transparent to the user, and the fact that we used the S3 CLI to create the bucket has absolutely no impact. When the job is over, you should see the following in your transactions-ecoex-clean bucket:
Before going further, let us look at the size of the data before and after cleaning:
As you can see, with ~2.5 GB of raw data, we extracted only ~10 MB of actually useful data (only 0,4%)!! What is noteworthy here is that that you can easily imagine usecases where you need a large-scale infrastructure to ingest and process the raw data but where one or a few VMs are enough to work on the clean data. Obviously, this is more often the case when working with text/structured data than with raw sound/image/videos.
Before we start training a model, take a look at these two screenshots from OVHcloud's data processing UI to erase any doubt you have about the power of distributed computing:
In the first picture, you see the time taken for this job when launching only 1 executor- 8:35 minutes. This duration is reduced to only 2:56 minutes when launching the same job (same code etc) on 4 executors: almost 3 times faster. And since you pay-as-you go, this will only cost you ~33% more in that case for the same operation done 3 times faster- without any modification to your code, only one argument in the CLI call. Let us now use this data to train a model.
To train the model, you are going to use OVHcloud AI notebook to deploy a notebook! With the following command, you will:
In our case, we launch a notebook with only 1 GPU because the code samples we provide would not leverage several GPUs for a single job. I could adapt my code to parallelize the training phase on multiple GPUs, in which case I could launch a job with up to 4 parallel GPUs.Once this is done, just get the URL of your notebook with the following command and connect to it with your browser:
Once you're done, just get the URL of your notebook with the following command and connect to it with your browser:
You can now import the real-estate-training.ipynb file to the notebook with just a few clicks. If you don't want to import it from the computer you use to access the notebook (for example if like me you use a VM to work and have cloned the git repo on this VM and not on your computer), you can push the .ipynb file to your transactions-ecoex-clean or transactions-ecoex-model bucket and re-synchronize the bucket to your notebook while it runs by using the ovhai notebook pull-data command. You will then find the notebook file in the corresponding directory.
Once you have imported the notebook file to your notebook instance, just open it and follow the directives. If you are interested in the result but don't want to do it yourself, let's sum up what the notebook does:
Use the models built in this tutorial at your own risk
So, what can we conclude from all of this? First, even if the second model is obviously better than the first, it is still very noisy: while not far from correct on average, there is still a huge variance. Where does this variance come from?
Well, it is not easy to say. To paraphrase the finishing part of my last article:
In this article, I tried to give you a glimpse at the tools that Data Scientists commonly use to manipulate data and train models at scale, in the Cloud or on their own infrastructure:
Hopefuly, you now have a better understanding on how Machine Learning algorithms work, what their limitations are, and how Data Scientists work on data to create models.
As explained earlier, all the code used to obtain these results can be found here. Please don't hesitate to replicate what I did or adapt it to other usecases!
Solutions ArchitectatOVHCloud|+ posts
Here is the original post:
OVH Groupe : A journey into the wondrous land of Machine Learning, or Cleaning data is funnier than cleaning my flat! (Part 3) - Marketscreener.com
- Open source machine learning systems are highly vulnerable to security threats - TechRadar - December 22nd, 2024 [December 22nd, 2024]
- After the PS5 Pro's less dramatic changes, PlayStation architect Mark Cerny says the next-gen will focus more on CPUs, memory, and machine-learning -... - December 22nd, 2024 [December 22nd, 2024]
- Accelerating LLM Inference on NVIDIA GPUs with ReDrafter - Apple Machine Learning Research - December 22nd, 2024 [December 22nd, 2024]
- Machine learning for the prediction of mortality in patients with sepsis-associated acute kidney injury: a systematic review and meta-analysis - BMC... - December 22nd, 2024 [December 22nd, 2024]
- Machine learning uncovers three osteosarcoma subtypes for targeted treatment - Medical Xpress - December 22nd, 2024 [December 22nd, 2024]
- From Miniatures to Machine Learning: Crafting the VFX of Alien: Romulus - Animation World Network - December 22nd, 2024 [December 22nd, 2024]
- Identification of hub genes, diagnostic model, and immune infiltration in preeclampsia by integrated bioinformatics analysis and machine learning -... - December 22nd, 2024 [December 22nd, 2024]
- This AI Paper from Microsoft and Novartis Introduces Chimera: A Machine Learning Framework for Accurate and Scalable Retrosynthesis Prediction -... - December 18th, 2024 [December 18th, 2024]
- Benefits and Challenges of Integrating AI and Machine Learning into EHR Systems - Healthcare IT Today - December 18th, 2024 [December 18th, 2024]
- The History Of AI: How Machine Learning's Evolution Is Reshaping Everything Around Us - SlashGear - December 18th, 2024 [December 18th, 2024]
- AI and Machine Learning to Enhance Pension Plan Governance and the Investor Experience: New CFA Institute Research - Fintech Finance - December 18th, 2024 [December 18th, 2024]
- Address Common Machine Learning Challenges With Managed MLflow - The New Stack - December 18th, 2024 [December 18th, 2024]
- Machine Learning Used To Classify Fossils Of Extinct Pollen - Offworld Astrobiology Applications? - Astrobiology News - December 18th, 2024 [December 18th, 2024]
- Machine learning model predicts CDK4/6 inhibitor effectiveness in metastatic breast cancer - News-Medical.Net - December 18th, 2024 [December 18th, 2024]
- New Lockheed Martin Subsidiary to Offer Machine Learning Tools to Defense Customers - ExecutiveBiz - December 18th, 2024 [December 18th, 2024]
- How Powerful Will AI and Machine Learning Become? - International Policy Digest - December 18th, 2024 [December 18th, 2024]
- ChatGPT-Assisted Machine Learning for Chronic Disease Classification and Prediction: A Developmental and Validation Study - Cureus - December 18th, 2024 [December 18th, 2024]
- Blood Tests Are Far From Perfect But Machine Learning Could Change That - Inverse - December 18th, 2024 [December 18th, 2024]
- Amazons AGI boss: You dont need a PhD in machine learning to build with AI anymore - Fortune - December 18th, 2024 [December 18th, 2024]
- From Novice to Pro: A Roadmap for Your Machine Learning Career - KDnuggets - December 10th, 2024 [December 10th, 2024]
- Dimension nabs $500M second fund for 'still contrary' intersection of bio and machine learning - Endpoints News - December 10th, 2024 [December 10th, 2024]
- Using Machine Learning to Make A Really Big Detailed Simulation - Astrobites - December 10th, 2024 [December 10th, 2024]
- Driving Business Growth with GreenTomatos Data and Machine Learning Strategy on Generative AI - AWS Blog - December 10th, 2024 [December 10th, 2024]
- Unlocking the power of data analytics and machine learning to drive business performance - WTW - December 10th, 2024 [December 10th, 2024]
- AI and the Ethics of Machine Learning | by Abwahabanjum | Dec, 2024 - Medium - December 10th, 2024 [December 10th, 2024]
- Differentiating Cystic Lesions in the Sellar Region of the Brain Using Artificial Intelligence and Machine Learning for Early Diagnosis: A Prospective... - December 10th, 2024 [December 10th, 2024]
- New Amazon SageMaker AI Innovations Reimagine How Customers Build and Scale Generative AI and Machine Learning Models - Amazon Press Release - December 10th, 2024 [December 10th, 2024]
- What is Machine Learning? 18 Crucial Concepts in AI, ML, and LLMs - Netguru - December 5th, 2024 [December 5th, 2024]
- Machine learning-based prediction of antibiotic resistance in Mycobacterium tuberculosis clinical isolates from Uganda - BMC Infectious Diseases - December 5th, 2024 [December 5th, 2024]
- Interdisciplinary Team Needed to Apply Machine Learning in Epilepsy Surgery: Lara Jehi, MD, MHCDS - Neurology Live - December 5th, 2024 [December 5th, 2024]
- A multimodal machine learning model for the stratification of breast cancer risk - Nature.com - December 5th, 2024 [December 5th, 2024]
- Machine learning based intrusion detection framework for detecting security attacks in internet of things - Nature.com - December 5th, 2024 [December 5th, 2024]
- Machine learning evaluation of a hypertension screening program in a university workforce over five years - Nature.com - December 5th, 2024 [December 5th, 2024]
- Vaultree Introduces VENum Stack: Combining the Power of Machine Learning and Encrypted Data Processing for Secure Innovation - PR Newswire - December 5th, 2024 [December 5th, 2024]
- Direct simulation and machine learning structure identification unravel soft martensitic transformation and twinning dynamics - pnas.org - December 5th, 2024 [December 5th, 2024]
- AI and Machine Learning - Maryland to use AI technology to manage traffic flow - SmartCitiesWorld - December 5th, 2024 [December 5th, 2024]
- Researchers make machine learning breakthrough in lithium-ion tech here's how it could make aging batteries safer - Yahoo! Voices - December 5th, 2024 [December 5th, 2024]
- Integrating IoT and machine learning: Benefits and use cases - TechTarget - December 5th, 2024 [December 5th, 2024]
- Landsat asks industry for artificial intelligence (AI) and machine learning for satellite operations - Military & Aerospace Electronics - December 5th, 2024 [December 5th, 2024]
- Machine learning optimized efficient graphene-based ultra-broadband solar absorber for solar thermal applications - Nature.com - December 5th, 2024 [December 5th, 2024]
- Polymathic AI Releases The Well: 15TB of Machine Learning Datasets Containing Numerical Simulations of a Wide Variety of Spatiotemporal Physical... - December 5th, 2024 [December 5th, 2024]
- Prediction of preterm birth using machine learning: a comprehensive analysis based on large-scale preschool children survey data in Shenzhen of China... - December 5th, 2024 [December 5th, 2024]
- Application of machine learning algorithms to identify serological predictors of COVID-19 severity and outcomes - Nature.com - November 30th, 2024 [November 30th, 2024]
- Predicting the time to get back to work using statistical models and machine learning approaches - BMC Medical Research Methodology - November 30th, 2024 [November 30th, 2024]
- AI and Machine Learning - US releases recommendations for use of AI in critical infrastructure - SmartCitiesWorld - November 30th, 2024 [November 30th, 2024]
- Machine learning-based diagnostic model for stroke in non-neurological intensive care unit patients with acute neurological manifestations -... - November 28th, 2024 [November 28th, 2024]
- Analysis of four long non-coding RNAs for hepatocellular carcinoma screening and prognosis by the aid of machine learning techniques - Nature.com - November 28th, 2024 [November 28th, 2024]
- Evaluation and prediction of the physical properties and quality of Jatob-do-Cerrado seeds processed and stored in different conditions using machine... - November 28th, 2024 [November 28th, 2024]
- Researchers use fitness tracker data and machine learning to detect bipolar disorder mood swings - Medical Xpress - November 28th, 2024 [November 28th, 2024]
- Advances in AI and Machine Learning for Nuclear Applications - Frontiers - November 28th, 2024 [November 28th, 2024]
- Researchers make machine learning breakthrough in lithium-ion tech here's how it could make aging batteries safer - The Cool Down - November 28th, 2024 [November 28th, 2024]
- Svitla Systems Publishes Results of the Study on Machine Learning's Role in Credit Scoring - Newsfile - November 28th, 2024 [November 28th, 2024]
- Predicting poor performance on cognitive tests among older adults using wearable device data and machine learning: a feasibility study - Nature.com - November 28th, 2024 [November 28th, 2024]
- Quantum Machine Learning: Bridging the Future of AI and Quantum Computing - TechBullion - November 28th, 2024 [November 28th, 2024]
- AI and machine learning trends in healthcare - Healthcare Leader - November 28th, 2024 [November 28th, 2024]
- Identification of biomarkers for the diagnosis in colorectal polyps and metabolic dysfunction-associated steatohepatitis (MASH) by bioinformatics... - November 28th, 2024 [November 28th, 2024]
- Revolutionizing Business Systems with Machine Learning: Practical Innovations for the Modern Era - TechBullion - November 28th, 2024 [November 28th, 2024]
- Can AI improve plant-based meats? Using mechanical testing and machine learning to mimic the sensory experience - Phys.org - November 16th, 2024 [November 16th, 2024]
- Machine Learning Reveals Impact of Microbial Load on Gut Health and Disease - Genetic Engineering & Biotechnology News - November 16th, 2024 [November 16th, 2024]
- Machine learning for predicting in-hospital mortality in elderly patients with heart failure combined with hypertension: a multicenter retrospective... - November 16th, 2024 [November 16th, 2024]
- Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for... - November 16th, 2024 [November 16th, 2024]
- Exploring electron-beam induced modifications of materials with machine-learning assisted high temporal resolution electron microscopy - Nature.com - November 16th, 2024 [November 16th, 2024]
- Facilitated the discovery of new / Co-based superalloys by combining first-principles and machine learning - Nature.com - November 16th, 2024 [November 16th, 2024]
- Thwarting Phishing Attacks with Predictive Analytics and Machine Learning in 2024 - Petri.com - November 16th, 2024 [November 16th, 2024]
- Optoelectronic performance prediction of HgCdTe homojunction photodetector in long wave infrared spectral region using traditional simulations and... - November 16th, 2024 [November 16th, 2024]
- A new approach for sex prediction by evaluating mandibular arch and canine dimensions with machine-learning classifiers and intraoral scanners (a... - November 16th, 2024 [November 16th, 2024]
- AI and Machine Learning - Google and National League of Cities develop AI toolkit - SmartCitiesWorld - November 16th, 2024 [November 16th, 2024]
- Machine learning for the physics of climate - Nature.com - November 14th, 2024 [November 14th, 2024]
- Red Hat acquires tech to lower the cost of machine learning - ComputerWeekly.com - November 14th, 2024 [November 14th, 2024]
- SUU Professor Receives Grant to Develop Machine Learning Certificate - Southern Utah University - November 14th, 2024 [November 14th, 2024]
- Research on the timing for subsequent water flooding in Alkali-Surfactant-Polymer flooding in Daqing Oilfield based on automated machine learning -... - November 14th, 2024 [November 14th, 2024]
- SNPs and blood inflammatory marker featured machine learning for predicting the efficacy of fluorouracil-based chemotherapy in colorectal cancer -... - November 14th, 2024 [November 14th, 2024]
- Speech production under stress for machine learning: multimodal dataset of 79 cases and 8 signals - Nature.com - November 14th, 2024 [November 14th, 2024]
- Xbox Series X Machine Learning Hardware Has Some Use Cases, But Microsoft Never Showed Interest in Doing Anything With It - Wccftech - November 14th, 2024 [November 14th, 2024]
- Get An Introduction to Optimization: With Applications to Machine Learning, 5th Edition for FREE and save $106! - BetaNews - November 14th, 2024 [November 14th, 2024]
- New Study Uses fMRI and Machine Learning to Explore Brain Function - AZoRobotics - November 14th, 2024 [November 14th, 2024]
- Introduction to Machine Learning (ML) | by Venkat | Nov, 2024 - Medium - November 14th, 2024 [November 14th, 2024]
- The future of PC gaming will be AI-driven - AMD confirms machine learning FSR 4 for 2025, launching in Call of Duty: Black Ops 6 - TechRadar - November 4th, 2024 [November 4th, 2024]
- Machine-Learning Platform Gives DoD Ability To ID Threat Network Activity - Defense Innovation Unit - November 4th, 2024 [November 4th, 2024]
- Machine Learning Offers a Water Bill Discount to Wealthy Portlander - Willamette Week - November 4th, 2024 [November 4th, 2024]