Archive for the ‘Machine Learning’ Category

Using Machine Learning to Track COVID-19 Washington and Lee University – Washington and Lee University News Office

By Louise UffelmanJuly 29, 2020

Working on a real-life project that will introduce students to how algorithms work in applications with crucial outcomes will provide them with the important skills that can transfer to other areas of computer and data science.

~ Moataz Khalifa

As the race for a COVID-19 vaccine continues, Moataz Khalifa, assistant professor and director of Data Education at Washington and Lee University, is involved in an equally promising research project that focuses on a non-invasive, early detection system of the virus.

In March, just as the numbers of cases were climbing around the world, Khalifa was invited by Wu Feng, Elizabeth & James Turner Fellow, professor of computer science at Virginia Tech and director of its SyNeRGy lab, to join his research lab to develop a deep-learning algorithm to enhance low-radiation CT scans of peoples lungs. Fengs current research was already investigating similar applications in CT scans of brain tumors, and he received two National Science Foundation grants totaling $250,000 to expand his project to work on the COVID-19 early detection system.

Currently, the genetic-based RT-PCR tests available to detect COVID-19 rely on swabbing the nasal cavity. With testing kits in short supply, accuracy of the results only around 59%, and increasingly long processing times of 10 to 14 days, the hunt was on for a system that was more accurate, available and faster to handle the growing demand for tracking infections.

Feng is excited to have Khalifa, who earned his Ph.D. in physics from Virginia Tech, join the team. My expertise is in parallel and distributed computing, and there are intersections with respect to data analytics and machine learning that weve got going on in our labs. Moataz has the skill set to look at different mathematical models used to build an algorithm to get inside the box so that we can build the best medical imaging software system possible.

CT scans, which combine X-rays and computers to capture 2-D cross-sectional images of the body, are able to detect the existence of COVID-19 through certain markers in the lungs. The initial studies from China and Europe on which Fengs team is building its research indicate that the detection using CT scans is possible in asymptomatic people and with higher accuracy (upwards of 90%) than any other test available, Khalifa said.

The barrier is getting people to the hospital for testing, but with a portable CT scanner, the testing can travel to where its needed, whether that be a college campus or an office complex. Portable scanners generally use less radiation and hence provide lower-resolution images. The algorithm Khalifa is working to modify will help enhance the quality of the images, improving the accuracy of detecting the coronavirus in its early stages. Another upside is that hundreds of people can be scanned a day, with results essentially available in real-time.

As the project moves forward, Khalifa will bring W&L undergraduates onto the research team. He noted that artificial intelligence and machine learning is a hot area right now. Working on a real-life project that will introduce students to how algorithms work in applications with crucial outcomes will provide them with the important skills that can transfer to other areas of computer and data science.

Khalifa added, W&Ls collaboration with Virginia Tech is a strong and mutually beneficial one, bringing us to the front lines of this fight against COVID-19. I cannot say enough about how exciting it is to be a part of it and to bring our university even closer to such a critical contribution during these times.

Read more here:
Using Machine Learning to Track COVID-19 Washington and Lee University - Washington and Lee University News Office

New Machine Learning Features, Data Integrations, and Upgraded Classification Engines Available in Grooper Version 2.9 – PRNewswire

OKLAHOMA CITY, July 28, 2020 /PRNewswire/ --Grooper, the leading intelligent document processing and digital data integration platform announces the release of version 2.9. Included are fourteen new capabilities that enhance machine learning, classification, separation, data integration, and reporting.

New Machine Learning FeaturesMachine learning is easier and more powerful. The new Rebuild Training features provide tuning and A/B testing using identical training sets and document training decisions.

Integration with BoxBuilt-in integration with Box.com enables file import and export, metadata mapping, data lookups, and more.

Advanced Document ClassificationTackle complicated document sets with advanced classification strategies. Target documents within or across groups that are lexically dissimilar or similar with high accuracy.

New Document ViewerUsers can choose from multiple document renditions to build better data extractions.

Improved Document SeparationDocument separation is now more robust and accurate due to new auto-separation logic. New page extractors separate unstructured documents.

Enhanced Database ExportDefine multiple exports on a single export step within a single database or spanning multiple tables. Multipart database exports are simplified and SQL server-generated identity columns are supported.

CMIS Data LookupsPopulate and validate data fields based on queryable metadata located on CMIS objects.

New Data Annotation Option in Data ReviewExtracted document data is now displayed at the extraction location on the document. This speeds up human data review and includes multiple configurable properties.

Content Type FilteringNow users can enable classification, extraction, and review to proceed in stages for larger more complicated projects.

Compile Stats FeatureThe Compile Stats feature provides comprehensive statistics on classification and extraction activities to assist administrators in developing and troubleshooting advanced content models.

Learn more about Grooper visit http://www.grooper.com.

About GrooperGrooper was built from the ground up by BIS, a company with 35 years of continuous experience developing and delivering new technology. Grooper is an intelligent document processing and digital data integration solution that empowers organizations to extract meaningful information from paper/electronic documents and other forms of unstructured data.

The platform combines patented and sophisticated image processing, capture technology, machine learning, natural language processing, and optical character recognition to enrich and embed human comprehension into data. By tackling tough challenges that other systems cannot resolve, Grooper has become the foundation for many industry-first solutions in healthcare, financial services, oil and gas, education, and government.

SOURCE Grooper

http://www.grooper.com

Here is the original post:
New Machine Learning Features, Data Integrations, and Upgraded Classification Engines Available in Grooper Version 2.9 - PRNewswire

Top Machine Learning Algorithms, Frameworks, Tools and Products Used by Data Scientists – Customer Think

A recent survey by Kaggle revealed that data professionals used a variety of different algorithms, tools, frameworks and products to extract insights. Top algorithms were linear/logistic regression, decision trees/random forests and Gradient Boosting Machines. Top frameworks were Scikit-learn and TensorFlow. Top tools for automation were related to model selection and data augmentation. While half of the respondents did not use ML products, the top products used were Google Cloud ML Engine, Azure ML Studio and Amazon Sagemaker.

Machine learning is employed by data scientists to find patterns and predict important outcomes. The application of machine learning reaches across industries (e.g., healthcare, education) and professions (e.g., marketing, content management), and data professionals have many different tools, methods and products they can use to extract useful insights. Kaggle conducted a survey in October 2019 of nearly 20,000 data professionals (2019 Kaggle Machine Learning and Data Science Survey) that reveals the variety of ways they solve their machine learning problems. Todays post is about the machine learning methods and tools data professionals used in 2019.

Figure 1. Top Machine Learning Algorithms Used in 2019. Click image to enlarge.

The survey included a question for data professionals, Which of the following machine learning algorithms do you use on a regular basis? Select all that apply. On average, data professionals used 3 (median) machine learning algorithms. The top 10 machine learning algorithms used were (see Figure 1):

Adoption rates for the top two algorithms were the highest for data professionals who self-identified as statistician and data scientist. Adoption rates were around 10 percentage points higher for these data pros (e.g., ~80% for linear/logistic regression, ~70% for decision trees and random forests).

Arecent poll by KDNuggets found similar results to the current study. In their study, machine learning methods also included regression (56%), decision trees/rules (48%), random forests (45%), Gradient Boosting Machines (23%).

Figure 2. Machine Learning Frameworks Used. Click image to enlarge.

The survey included a question, Which of the following machine learning frameworks do you use on a regular basis? Select all that apply. On average, data professionals used 2 (median) machine learning frameworks. The top 10 machine learning frameworks used were (see Figure 2):

Figure 3. Machine Learning Tools Used. Click image to enlarge.

The survey also asked all data professionals about the machine learning tools they used. A little over half of the respondents (53%) indicated that they did not use any automated machine learning tools. The most used automated machine learning tool used were (see Figure 3):

Figure 4. Machine Learning Products Used. Click image to enlarge.

The survey also asked all data professionals about the machine learning products they used. A little over a third of the respondents (38%) indicated that they did not use any machine learning products. The most used automated machine learning products used were (see Figure 4):

I conducted a principal components analysis of all the various machine learning utilities to identify groupings of these machine learning methods. I found a fairly clear 9-component solution:

Azure Machine Learning Studio stood out as the lone product as it did not load on any of the 9 components.

The pattern of results show that the various machine learning methods tend to be used together. For example, when ML automation tools are used, data professionals tend to use all of them. Similarly, data professionals either tend use all Google products or use none of them. Data professionals who employ evolutionary approaches also tend to use generative adversarial networks.

The results of the Kaggle survey of nearly 20,000 data professionals reveals the most popular machine learning algorithms, products, tools and frameworks.

While machine learning is still a hot and growing field of data science, over a third of the respondents do not use any ML products. Top algorithms used were linear/logistic regression, decision trees/random forests and Gradient Boosting Machines. The most used machine learning frameworks were Scikit-learn and TensorFlow. Top tools for machine learning automation were related to model selection and data augmentation. The top products used were Google Cloud ML Engine, Azure ML Studio and Amazon Sagemaker.

Excerpt from:
Top Machine Learning Algorithms, Frameworks, Tools and Products Used by Data Scientists - Customer Think

Putting AI and Machine Learning to Work in Cloud-Based BI and Analytics – AiThority

Machine learning (ML) in the cloud is powering a whole new generation of intelligent and predictive cloud analytics solutions like Azure Databricks and Azure Synapse. The benefits of cloud economics, tooling and flexibility, along with next-level insights to drive real time business decisions are the primary drivers behind the growing trend of on-premise data lake migrations to the cloud.

Cloud analytics services like Synapse are designed to collect and analyze current and actionable data delivering insights into processes and workflows that can impact business operations. But what if you need those insights immediately, and you need them in the hands of employees and experts who are working simultaneously across the globe in real time and always accurate and up to date? IT stakeholders are turning to the cloud for faster, more accurate and timelier business insights especially in the face of Covid-19 where companies are looking to operate as economically possible and millions are forced into remote working locations.

Even before the pandemic, a 2019 survey by TechTarget found that 27% of respondents plan to deploy cloud analytics in 2020. That same study points to an increase in cloud technology as the number two activity that companies are employing to improve employee experience and productivity, and notes that 38% of companies plan to bolster their cloud technology in 2020. In speaking to the experts at AWS and Azure, that number is higher today. Hindsight is also 20/20!

There are multiple reasons that organizations are moving their data lakes and analytics capabilities to the cloud. First among them is cost: the move streamlines a workforce, so even though there are start-up costs involved in the migration process, the long-term cost-benefit analysis plays out in their favor. Companies are also able to run faster and lighter with cloud analytics with no need to run dedicated client-side applications and IT teams freed of the necessity of coordinating upgrades across an entire infrastructure. In our experience across our customer base at WANdisco and in working with CSPs like Azure and AWS, we have found, on average, that the total cost of ownership to manage a 1PB Hadoop data lake on premise over a three year period costs a company $2M. To manage that same 1PB in AWS S3 or Azure ADLS Gen 2 storage costs $900,000 over three years.

The question is how to most rapidly (time to value) migrate that 1PB data lake with zero downtime and ensuring the data is consistent on prem and in the cloud during migration as the data is always changing if its business critical. The architects and data teams have two choices.

They can use various flavors of open source DistCP tools and scripts, which is the manual approach to a data lake migration. Dont be fooled by fancy names by the Hadoop or Cloud vendors. Its all DistCP under the covers. Whats wrong with this approach? Its an IT project. And like most IT projects, 61% of them either fail or suffer cost and SLA overruns. Heres what you have to do in this scenario:

How long can this take? We have seen teams struggle for months and even years depending on data volume and business requirements around acceptable application downtime, data availability and data consistency. Weve seen companies put 8-10 people on projects, fail after 6 months, then pay $1M to a systems integrator and fail after another 9 months. OUCH.

There is a better way. And forward-looking companies like AMD, Daimler, and many others have figured it out. How?

By leveraging modern technology to automate data lake migration and replication to the cloud with WANdisco LiveData Cloud Services through its patented Distributed Coordination Engine platform.

This innovation is founded on fundamental IP which is based around forming consensus in a distributed network. This is an extremely hard problem to solve and to this day some people believe that it cannot be solved. So what is this problem at a high level? If you have a network of nodes, distributed across the world with little to no knowledge of the distance and bandwidth between the nodes, how can you get the nodes to coordinate between each other without worrying about any failure scenarios?

The solution is the application of a consensus algorithm and the gold standard in consensus is an algorithm called Paxos. Our chief Scientist Dr. Yeturu Aahlad, an expert in distributed systems, devised the first, and even now only, commercialised version of Paxos. By doing so, he solved a problem that had been puzzling computer scientists for years.

WANdiscos LiveData Cloud Services are based on this core IP including our products focused on analytical data and the challenge of migrating this data to the cloud and keeping the data consistent in multiple locations.

As businesses request to have data available in a more and more decentralized environment, the old mechanisms to provide and manage data are not sufficient anymore. Moreover, the amount of data is rising exponentially which leads to a phenomenon called data gravity. With an increasing volume of data, the more it is a challenge to provide this in a distributed environment, allow changes to the data in any environment, and ensure it remains consistent across all environments. Additionally regulation and compliance requirements make it even more challenging for data managers to fulfil businesses needs.

As enterprises look to leverage the scale and economics of the cloud, WANdisco offers a fundamentally different approach to manage these large volumes of data accelerating the ability for enterprises to undergo digital transformation.

Heres what Merv Adrian, Research VP of Data and Analytics at Gartner had to say, WANdiscos ability to move petabytes of data without interrupting production and without risk of losing the data midflight is something no other vendor does and, until now, has been virtually impossible to accomplish.

The Bottom Line

Cloud computing has completely transformed entire industries, computing paradigms and enterprises, and has become the ideal for storing and accessing big data sets. The Covid-19 pandemic has only accelerated this move given the need to operate as economically as possible with more employees working remotely. Cloud computing saves both money and time, which makes it immediately attractive to businesses, while also increasing access for global companies, providing a synergic platform for coordination and cooperation between far-flung employees. 85% of the Fortune 500 have moved to the cloud and continue to do so. The migration of static data has been easy. The challenge now has been how to quickly migrate and replicate large on-premises data lakes and applications to the cloud, when the data is business critical and application downtime, data loss and inconsistencies cannot be tolerated. The good news is that now there is a better way via automated migration and replication that delivers 10X faster time to value, is 100% safer, while ensuring zero downtime during migration.

Share and Enjoy !

Read the original here:
Putting AI and Machine Learning to Work in Cloud-Based BI and Analytics - AiThority

IDTechEx Report Suggests Machine Learning will be Accessible across Chemical and Materials Companies in the Future – CIO Applications

Material Informatics (MI) is a data-centric approach applicable to specific material science and chemistry R&D. Without a doubt, this will become a standard method in a research scientist toolkit.

FREMONT, CA: Machine learning has rapidly become an essential part of every industry. Material scientists and chemists will all have access to machine learning tools to enhance their Research & Development in the future. Seamlessly integrating these underlying operations will not happen quickly, but overlooking the developments in materials informatics will lead to a loss of competitive advantage.

Material Informatics (MI) is a data-centric approach applicable to specific material science and chemistry R&D. Without a doubt, this will become a standard method in a research scientist toolkit. Instead of just grabbing headlines, some form of MI will be assumed in all developments. The key to MI is around the integration, implementation, and manipulation of data infrastructures as well as machine learning approaches designed for chemical and materials datasets.

There is a significant amount of evidence to support this. However, the best backing is how the industries are responding to the technology. There has been a large amount of activity over recent years, including partnerships, investments, and announcements from some of the most notable chemical and materials companies.

Machine learning, by itself, can be used in various kinds of projects, from finding new structure-property relationships, proposing new candidates or process conditions, reducing the number of expensive and time-consuming computer simulations, and more. Machine learning approaches can take numerous forms of supervised and unsupervised learning methods. Generative methods can be effective at screening for optimized outputs across organic compounds. At the same time, even simple modified random forest models can be useful for proposing follow-on reactions to meet a desired set of criteria.

However, this is still at an early stage and requires a lot more development. There is a lot to be leveraged from existing developments in AI, but will first require integrating specialist domain knowledge and coping with the unique challenges of a materials dataset. The application space is broad, and studies have shown success ranging from organometallics, thermoelectrics, nanomaterials, and ceramics to many more.

Original post:
IDTechEx Report Suggests Machine Learning will be Accessible across Chemical and Materials Companies in the Future - CIO Applications