Data Science Weekly Newsletter - Issue 386

Issue #354

Sep 03 2020

Editor Picks
  • US Presidential Voices Over the Ages
    Using data science techniques and tools for Natural Language Processing and Unsupervised Learning, I set out to better understand Presidents’ use of their speeches’ power by examining the sentiment, sophistication of speech, and focus of content for over 990 presidential speeches...

A Message from this week's Sponsor:


Data scientists are in demand on Vettery

Vettery is an online hiring marketplace that's changing the way people hire and get hired. Ready for a bold career move? Make a free profile, name your salary, and connect with hiring managers from top employers today.

Data Science Articles & Videos

  • The new contender to GANs: score matching with Langevin Sampling
    Last year, Yang Song, a graduate student at Stanford, showed an entirely new way of generating data based on denoising score matching with Annealed Langevin sampling (DSM-ALS). The paper showed that a non-adversarial approach could reach levels similar to GANs (FID of 25, which is what Relativistic GANs reached two years ago)...In this article, I will explain the basics of DSM-ALS as simply as possible. I will then talk about the good and the bad of this approach...
  • Semantic Pyramid for Image Generation
    We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid - a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features.....
  • Introducing the Apple AI/ML residency program
    The year-long program will welcome residents with STEM graduate degrees or equivalent industry experience, software development backgrounds, and niche expertise — like design, linguistics, neuroscience, or psychology. The program aims to invest in the resident’s technical and theoretical machine learning development, and help advance their professional careers...The residents will have the opportunity to attend personalized machine learning and AI courses, learn from an Apple mentor closely involved in their program, collaborate with fellow multi-talented residents, and gain hands-on experience working on high-impact projects with our machine learning teams....
  • learning@home: Hivemind - train large neural networks across the internet
    Hivemind is a library for decentralized training of large neural networks. In a nutshell, you want to train a neural network, but all you have is a bunch of enthusiasts with unreliable computers that communicate over the internet. Any peer may fail or leave at any time, but the training must go on. To meet this objective, hivemind models use a specialized layer type: the Decentralized Mixture of Experts (DMoE). Here's how it works...
  • Domain-specific language model pretraining for biomedical natural language processing
    In this blog post, we present our recent advances in pretraining neural language models for biomedical NLP. We question the prevailing assumption that pretraining on general-domain text is necessary and useful for specialized domains such as biomedicine. Instead, we show that biomedical text is very different from newswires and web text. By pretraining solely on biomedical text from scratch, our PubMedBERT model outperforms all prior language models and obtains new state-of-the-art results in a wide range of biomedical applications...
  • A round-up of topology-based papers at ICML 2020
    With this year’s International Conference on Machine Learning (ICML) being over, it is time to have another instalment of this series. Similar to last year’s post, I shall cover several papers that caught my attention because of their use of topological concepts—however, unlike last year, I shall not restrict the selection to papers using topological data analysis (TDA)...
  • European Conference on Computer Vision 2020: Some Highlights
    The 2020 European Conference on Computer Vision (ECCV) took place online, from 23 to 28 August, and consisted of 1360 papers, divided into 104 orals, 160 spotlights and the rest of 1096 papers as posters. In addition to 45 workshops and 16 tutorials. As it is the case in recent years with ML and CV conferences, the huge number of papers can be overwhelming at times. Similar to my CVPR2020 post, to get a grasp of the general trends of the conference this year, I will present in this blog post a sort of a snapshot of the conference by summarizing some papers (& listing some) that grabbed my attention...
  • Opacus: A high-speed library for training PyTorch models with differential privacy
    A new high-speed library for training PyTorch models with differential privacy (DP) that’s more scalable than existing state-of-the-art methods. Differential privacy is a mathematically rigorous framework for quantifying the anonymization of sensitive data. It’s often used in analytics, with growing interest in the machine learning (ML) community. With the release of Opacus, we hope to provide an easier path for researchers and engineers to adopt differential privacy in ML, as well as to accelerate DP research in the field...



Quick Question For You: Do you want a Data Science job?

After helping hundred of readers like you get Data Science jobs, we've distilled all the real-world-tested advice into a self-directed course.

The course is broken down into three guides:
  1. Data Science Getting Started Guide. This guide shows you how to figure out the knowledge gaps that MUST be closed in order for you to become a data scientist quickly and effectively (as well as the ones you can ignore)

  2. Data Science Project Portfolio Guide. This guide teaches you how to start, structure, and develop your data science portfolio with the right goals and direction so that you are a hiring manager's dream candidate

  3. Data Science Resume Guide. This guide shows how to make your resume promote your best parts, what to leave out, how to tailor it to each job you want, as well as how to make your cover letter so good it can't be ignored!
Click here to learn more ...

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!



  • Data Scientist (Entry Level) - Saturn Cloud - Remote

    Saturn Cloud helps companies perform data science at a new level of scale, with one-click solutions, to solve the world’s hardest problems. Our product is a SaaS platform which equips data science teams with high-leverage automation tools, eliminating hours of traditional, manual work. The platform is user-friendly, scalable and secure.

    You will be an entry-level Data Scientist for Saturn Cloud, an exciting new venture founded by the creators of Anaconda, NumPy, and SciPy. The role features drafting the first generation of Saturn resource materials, tutorials, and technical content...

        Want to post a job here? Email us for details >>


Training & Resources

  • Are categorical variables getting lost in your random forests?
    Decision tree models can handle categorical variables without one-hot encoding them. However, popular implementations of decision trees (and random forests) differ as to whether they honor this fact. We show that one-hot encoding can seriously degrade tree-model performance. Our primary comparison is between H2O (which honors categorical variables) and scikit-learn (which requires them to be one-hot encoded)...
  • Getting Started with Machine Learning
    A community-driven place to get started with machine learning and AI. This list is not definite, nor sequential, but we hope it’s a good starting place for anyone looking to get into the field. All resources mentioned in this guide are free, and include a little description of why they are useful. Each section has a set of starting points (usually courses, books, blog posts, etc.), relevant papers and project ideas....



  • Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

    Integrate scikit-learn with various tools such as NumPy, pandas, imbalanced-learn, and scikit-surprise and use it to solve real-world machine learning problems...

    For a detailed list of books covering Data Science, Machine Learning, AI and associated programming languages check out our resources page.

    P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.