# Paper：《A Few Useful Things to Know About Machine Learning》翻译与解读

Paper：《A Few Useful  Things to  Know About  Machine  Learning》翻译与解读

《A Few Useful  Things to  Know About  Machine  Learning》翻译与解读了解机器学习的一些有用的东西

key insights 重要见解

Learning = Representation +  Evaluation + Optimization 学习=表示+评估+优化

It’s Generalization that Counts  重要的是概括

Data Alone Is Not Enough  仅数据不足

Intuition Fails in High Dimensions  高维直觉失败

Theoretical Guarantees  Are Not What They Seem 理论上的保证不是他们所看到的

Feature Engineering Is The Key  特征工程是关键

More Data Beats  a Cleverer Algorithm  智慧算法带来更多数据优势

Learn Many Models, Not Just One  学习多种模型，而不仅仅是一种

Simplicity Does Not  Imply Accuracy  简易性不准确

Representable Does Not  Imply Learnable  有代表性的不容易学习

Correlation Does Not  Imply Causation  关联不表示因果关系

Conclusion  结论

# 《A Few Useful  Things to  Know About  Machine  Learning》翻译与解读 了解机器学习的一些有用的东西

 Machine learning systems automatically learn  programs from data. This is often a very attractive  alternative to manually constructing them, and in the  last decade the use of machine learning has spread  rapidly throughout computer science and beyond.  Machine learning is used in Web search, spam filters,  recommender systems, ad placement, credit scoring,  fraud detection, stock trading, drug design, and many  other applications. A recent report from the McKinsey  Global Institute asserts that machine learning (a.k.a.  data mining or predictive analytics) will be the driver  of the next big wave of innovation.15 Several fine  textbooks are available to interested practitioners and  researchers (for example, Mitchell16 and Witten et  al.24). However, much of the “folk knowledge” that is needed to successfully develop  machine learning applications is not  readily available in them. As a result,  many machine learning projects take  much longer than necessary or wind  up producing less-than-ideal results.  Yet much of this folk knowledge is  fairly easy to communicate. This is  the purpose of this article. 机器学习系统会自动从数据中学习程序。这通常是手动构建它们的一种非常有吸引力的替代方法，并且在过去的十年中，机器学习的使用已迅速遍及整个计算机科学及其他领域。机器学习用于Web搜索，垃圾邮件过滤器，推荐系统，广告排名，信用评分，欺诈检测，股票交易，药物设计以及许多其他应用程序中。麦肯锡全球研究院最近的一份报告断言，机器学习（又名数据挖掘或预测分析）将成为下一波创新浪潮的驱动力.15有兴趣的从业者和研究人员可以使用几本精美的教科书（例如Mitchell16和Witten等24）。但是，成功开发机器学习应用程序所需的许多民间知识''尚不容易获得。结果，许多机器学习项目花费的时间比必要的时间长得多，或者结束时产生的结果不理想。然而，许多民间知识非常容易交流。这就是本文的目的。

## key insights 重要见解

 Machine learning algorithms can figure  out how to perform important tasks  by generalizing from examples. This is  often feasible and cost-effective where  manual programming is not. As more  data becomes available, more ambitious  problems can be tackled.   Machine learning is widely used in  computer science and other fields.  However, developing successful  machine learning applications requires a  substantial amount of “black art” that is  difficult to find in textbooks.   This article summarizes 12 key lessons  that machine learning researchers and  practitioners have learned. These include  pitfalls to avoid, important issues to focus  on, and answers to common questions. 机器学习算法可以通过示例总结来弄清楚如何执行重要任务。在没有手动编程的情况下，这通常是可行且具有成本效益的。随着越来越多的数据可用，可以解决更多雄心勃勃的问题。机器学习广泛应用于计算机科学和其他领域。但是，开发成功的机器学习应用程序需要大量的妖术''，这在教科书中很难找到。本文总结了机器学习研究人员和从业人员所学的12项关键课程。这些包括要避免的陷阱，需要重点关注的重要问题以及常见问题的答案。 Many different types of machine  learning exist, but for illustration  purposes I will focus on the most  mature and widely used one: classification.  Nevertheless, the issues I  will discuss apply across all of machine  learning. A classifier is a system  that inputs (typically) a vector  of discrete and/or continuous feature  values and outputs a single discrete  value, the class. For example,  a spam filter classifies email messages  into “spam” or “not spam,”  and its input may be a Boolean vector  x = (x1,…,xj,…,xd), where xj = 1 if  the j  th word in the dictionary appears  in the email and xj = 0 otherwise. A  learner inputs a training set of examples  (xi, yi), where xi = (xi,1 , . . . ,  xi,d) is an observed input and yi is the  corresponding output, and outputs  a classifier. The test of the learner is  whether this classifier produces the  correct output yt for future examples  xt (for example, whether the spam  filter correctly classifies previously  unseen email messages as spam or  not spam). 存在许多不同类型的机器学习，但出于说明目的，我将重点介绍最成熟且使用最广泛的一种：分类。尽管如此，我将讨论的问题适用于所有机器学习。分类器是一个系统，通常输入离散和/或连续特征值的向量并输出单个离散值的类。例如，垃圾邮件过滤器将电子邮件分类为“垃圾邮件”或“非垃圾邮件”，其输入可能是布尔向量x =（x1，...，xj，...，xd），如果第j个单词的话xj = 1字典中出现在电子邮件中，否则xj = 0。学习者输入一组训练示例（xi，yi），其中xi =（xi，1，。。。xi，d）是观察到的输入，yi是相应的输出，并输出分类器。学习者的考验是此分类器是否为将来的示例xt生成正确的输出yt（例如，垃圾邮件过滤器是否将先前未见过的电子邮件正确分类为垃圾邮件或非垃圾邮件）。

## Learning = Representation +  Evaluation + Optimization 学习=表示+评估+优化

 Suppose you have an application that  you think machine learning might be  good for. The first problem facing you  is the bewildering variety of learning algorithms  available. Which one to use?  There are literally thousands available,  and hundreds more are published each  year. The key to not getting lost in this  huge space is to realize that it consists  of combinations of just three components.  The components are:   Representation. A classifier must  be represented in some formal language  that the computer can handle.  Conversely, choosing a representation  for a learner is tantamount to  choosing the set of classifiers that it  can possibly learn. This set is called  the hypothesis space of the learner.  If a classifier is not in the hypothesis  space, it cannot be learned. A related  question, that I address later, is how  to represent the input, in other words,  what features to use.Evaluation. An evaluation function  (also called objective function or scoring function) is needed to distinguish  good classifiers from bad  ones. The evaluation function used  internally by the algorithm may differ  from the external one that we want  the classifier to optimize, for ease of  optimization and due to the issues I  will discuss.  Optimization. Finally, we need  a method to search among the classifiers  in the language for the highest-scoring  one. The choice of optimization  technique is key to the  efficiency of the learner, and also  helps determine the classifier produced  if the evaluation function has  more than one optimum. It is common  for new learners to start out using  off-the-shelf optimizers, which are later  replaced by custom-designed ones. 假设您有一个您认为机器学习可能很适合的应用程序。您面临的第一个问题是可用的学习算法种类繁多。使用哪一个？实际上有数千种可用，每年都会发布数百种。在这个巨大空间中不迷路的关键是要认识到它仅由三个组成部分组成。这些组件是： 表示。分类器必须以计算机可以处理的某种正式语言表示。相反，为学习者选择表示形式等同于选择其可能学习的分类器集合。该集合称为学习者的假设空间。如果分类器不在假设空间中，则无法学习。我稍后将解决的一个相关问题是如何表示输入，换句话说，要使用的功能。 评估。需要评估函数（也称为目标函数或评分函数）以区分良好的分类器和不良的分类器。该算法在内部使用的评估函数可能与我们希望分类器优化的外部评估函数有所不同，这是为了简化优化以及由于我将要讨论的问题。 优化。最后，我们需要一种在语言中的分类器中搜索得分最高的方法。优化技术的选择是学习者效率的关键，并且如果评估函数具有多个最优值，则有助于确定所生成的分类器。对于新学习者来说，通常首先使用现成的优化器，然后由定制设计的优化器代替。 The accompanying table shows  common examples of each of these  three components. For example, knearest  neighbor classifies a test example  by finding the k most similar  training examples and predicting the  majority class among them. HyperI  plane-based methods form a linear combination of the features per class  and predict the class with the highest-valued  combination. Decision  trees test one feature at each internal  node, with one branch for each feature  value, and have class predictions  at the leaves. Algorithm 1 (above)  shows a bare-bones decision tree  learner for Boolean domains, using  information gain and greedy search.20  InfoGain(xj, y) is the mutual information  between feature xj and the class y.  MakeNode(x,c0,c1) returns a node that  tests feature x and has c0 as the child  for x = 0 and c1 as the child for x = 1.   Of course, not all combinations of  one component from each column of  the table make equal sense. For example,  discrete representations naturally  go with combinatorial optimization,  and continuous ones with continuous  optimization. Nevertheless, many  learners have both discrete and continuous  components, and in fact the day may not be far when every single  possible combination has appeared in  some learner!   Most textbooks are organized by  representation, and it is easy to overlook  the fact that the other components  are equally important. There is  no simple recipe for choosing each  component, but I will touch on some  of the key issues here. As we will see,  some choices in a machine learning  project may be even more important  than the choice of learner. 下表显示了这三个组件中每个组件的通用示例。例如，knearest邻居通过找到k个最相似的训练示例并预测其中的大多数类别来对测试示例进行分类。基于HyperI的基于平面的方法形成每个类别的特征的线性组合，并以最高价值的组合来预测类别。决策树在每个内部节点上测试一个功能，每个功能值具有一个分支，并在树叶上进行类预测。上面的算法1显示了使用信息增益和贪婪搜索的布尔域的基本决策树学习器.20 InfoGain（xj，y）是特征xj和类别y之间的互信息。 MakeNode（x，c0，c1）返回一个测试特征x的节点，对于x = 0，将c0作为子节点，对于x = 1，将c1作为其子节点。 当然，并非表中每一列的一个组件的所有组合都具有同等意义。例如，离散表示自然会进行组合优化，而连续表示则会进行连续优化。尽管如此，许多学习者具有离散和连续的组成部分，实际上，当每个单独的可能组合出现在某个学习者中的日子可能并不遥远！ 大多数教科书都是按代表形式组织的，很容易忽略其他组成部分同等重要的事实。没有选择每个组件的简单方法，但是我将在这里介绍一些关键问题。正如我们将看到的那样，机器学习项目中的某些选择可能比学习者的选择更为重要。

## It’s Generalization that Counts  重要的是概括

 The fundamental goal of machine  learning is to generalize beyond the  examples in the training set. This is  because, no matter how much data  we have, it is very unlikely that we will  see those exact examples again at test  time. (Notice that, if there are 100,000  words in the dictionary, the spam filter  described above has 2100,000 possible different inputs.) Doing well on  the training set is easy (just memorize  the examples). The most common  mistake among machine learning beginners  is to test on the training data  and have the illusion of success. If the  chosen classifier is then tested on new  data, it is often no better than random  guessing. So, if you hire someone  to build a classifier, be sure to keep  some of the data to yourself and test  the classifier they give you on it. Conversely,  if you have been hired to build  a classifier, set some of the data aside  from the beginning, and only use it to  test your chosen classifier at the very  end, followed by learning your final  classifier on the whole data. 机器学习的基本目标是超越训练集中的示例进行概括。这是因为，无论我们拥有多少数据，我们都不太可能在测试时再次看到这些确切的例子。 （请注意，如果字典中有100,000个单词，则上述垃圾邮件过滤器可能有2100,000个不同的输入。）在训练集上做得很好很容易（请记住示例）。机器学习初学者中最常见的错误是对训练数据进行测试并产生成功的幻觉。如果选择的分类器随后在新数据上进行测试，则通常不会比随机猜测更好。因此，如果您雇用某人来构建分类器，请确保将一些数据保留给自己并测试他们在其中提供给您的分类器。相反，如果您被雇用来构建分类器，请从一开始就保留一些数据，并仅在最后使用它来测试所选的分类器，然后再对整个数据学习最终的分类器。 Contamination of your classifier by  test data can occur in insidious ways,  for example, if you use test data to  tune parameters and do a lot of tuning.  (Machine learning algorithms  have lots of knobs, and success often  comes from twiddling them a lot,  so this is a real concern.) Of course,  holding out data reduces the amount  available for training. This can be mitigated  by doing cross-validation: randomly  dividing your training data into  (say) 10 subsets, holding out each one  while training on the rest, testing each  learned classifier on the examples it  did not see, and averaging the results  to see how well the particular parameter  setting does.   In the early days of machine learning,  the need to keep training and test  data separate was not widely appreciated.  This was partly because, if the  learner has a very limited representation  (for example, hyperplanes), the  difference between training and test  error may not be large. But with very  flexible classifiers (for example, decision  trees), or even with linear classifiers  with a lot of features, strict separation  is mandatory. 测试数据对分类器的污染可能以阴险的方式发生，例如，如果您使用测试数据来调整参数并进行大量调整。 （机器学习算法有很多旋钮，而成功往往来自于大量的纠缠，因此这是一个真正的问题。）当然，保留数据会减少可用于训练的数量。可以通过交叉验证来缓解这种情况：将您的训练数据随机分为10个子集（例如10个子集），在其余部分进行训练时坚持每个子集，在未看到的示例上测试每个学习的分类器，然后平均结果以查看特定参数设置的效果如何。 在机器学习的早期，对训练和测试数据保持分开的需求并未得到广泛认可。这部分是因为，如果学习者的表示形式非常有限（例如，超平面），则训练与测试错误之间的差异可能不会很大。但是对于非常灵活的分类器（例如决策树），甚至对于具有很多功能的线性分类器来说，严格的分隔是强制性的。 Notice that generalization being  the goal has an interesting consequence  for machine learning. Unlike  in most other optimization problems,  we do not have access to the function  we want to optimize! We have to use  training error as a surrogate for test  error, and this is fraught with danger.  (How to deal with it is addressed  later.) On the positive side, since the  objective function is only a proxy for  the true goal, we may not need to fully optimize it; in fact, a local optimum  returned by simple greedy search may  be better than the global optimum. 请注意，泛化是机器学习的目标产生了有趣的结果。 与大多数其他优化问题不同，我们无权访问我们要优化的功能！ 我们必须使用训练错误作为测试错误的替代品，这充满了危险。 从积极的方面来看，由于目标函数只是真实目标的代理，因此我们可能不需要完全优化它; 实际上，通过简单的贪婪搜索返回的局部最优值可能要好于全局最优值。

## Data Alone Is Not Enough  仅数据不足

 Generalization being the goal has another  major consequence: Data alone  is not enough, no matter how much  of it you have. Consider learning a  Boolean function of (say) 100 variables  from a million examples. There  are 2100 − 106   examples whose classes  you do not know. How do you figure  out what those classes are? In the absence  of further information, there is  just no way to do this that beats flipping  a coin. This observation was first  made (in somewhat different form) by  the philosopher David Hume over 200  years ago, but even today many mistakes  in machine learning stem from  failing to appreciate it. Every learner  must embody some knowledge or assumptions  beyond the data it is given  in order to generalize beyond it. This  notion was formalized by Wolpert in  his famous “no free lunch” theorems,  according to which no learner can  beat random guessing over all possible  functions to be learned.25. This seems like rather depressing  news. How then can we ever hope to  learn anything? Luckily, the functions  we want to learn in the real world are  not drawn uniformly from the set of all  mathematically possible functions! In  fact, very general assumptions—like  smoothness, similar examples having  similar classes, limited dependences,  or limited complexity—are  often enough to do very well, and this  is a large part of why machine learning  has been so successful. Like deduction,  induction (what learners do)  is a knowledge lever: it turns a small  amount of input knowledge into a  large amount of output knowledge.  Induction is a vastly more powerful  lever than deduction, requiring much  less input knowledge to produce useful  results, but it still needs more than  zero input knowledge to work. And, as  with any lever, the more we put in, the  more we can get out. 泛化是目标的另一个主要后果：数据量不够，无论您拥有多少数据。考虑从一百万个示例中学习一个（布尔）100个变量的布尔函数。有2100 − 106个您不知道其类的示例。您如何弄清楚这些类是什么？在没有更多信息的情况下，根本没有办法像抛硬币一样做到这一点。这种观察是200多年前哲学家戴维·休（（David Hume）首次提出的（形式有所不同），但直到今天，机器学习中的许多错误仍然源于对它的欣赏。每个学习者都必须在给出的数据之外体现一些知识或假设，以便对其进行概括。 Wolpert在他著名的“免费午餐”定理中正式化了这个概念，根据该定理，任何学习者都无法对将要学习的所有可能功能进行随机猜测。25。 这似乎令人沮丧。那我们怎么能希望学到什么呢？幸运的是，我们要在现实世界中学习的功能并不是从所有数学上可能的功能集中统一得出的！实际上，非常平滑的假设（例如平滑度，具有相似类，有限依赖项或有限复杂性的类似示例）通常足以很好地完成工作，这是机器学习如此成功的很大一部分。就像演绎一样，归纳（学习者的工作）是一种知识杠杆：它将少量的输入知识变成大量的输出知识。归纳比推论具有更强大的杠杆作用，需要更少的输入知识才能产生有用的结果，但它仍然需要超过零的输入知识才能起作用。而且，就像使用任何杠杆一样，我们投入的越多，我们越能脱身。 A corollary of this is that one of the  key criteria for choosing a representation  is which kinds of knowledge are  easily expressed in it. For example, if  we have a lot of knowledge about what  makes examples similar in our domain, instance-based methods may  be a good choice. If we have knowledge  about probabilistic dependencies,  graphical models are a good fit.  And if we have knowledge about what  kinds of preconditions are required by  each class, “IF . . . THEN . . .” rules may  be the best option. The most useful  learners in this regard are those that  do not just have assumptions hardwired  into them, but allow us to state  them explicitly, vary them widely, and  incorporate them automatically into  the learning (for example, using firstorder  logic21 or grammars6  ). In retrospect, the need for knowledge  in learning should not be surprising.  Machine learning is not  magic; it cannot get something from  nothing. What it does is get more  from less. Programming, like all engineering,  is a lot of work: we have to  build everything from scratch. Learning  is more like farming, which lets  nature do most of the work. Farmers  combine seeds with nutrients to grow  crops. Learners combine knowledge  with data to grow programs. 一个必然的推论是选择一种表示形式的关键标准之一就是在其中容易表达哪种知识。例如，如果我们对使示例在我们的领域中变得相似有很多了解，那么基于实例的方法可能是一个不错的选择。如果我们了解有关概率依赖性的知识，则图形模型非常适合。如果我们了解每个类都需要哪些先决条件，则“ IF。 。 。然后 。 。 。”规则可能是最佳选择。在这方面最有用的学习者是那些不仅将假设硬性地扎入其中的假设，而且使我们能够明确地陈述它们，进行广泛的变化并将它们自动地纳入学习中（例如使用一阶logic21或grammars6）。 回想起来，学习中知识的需求不足为奇。机器学习不是魔术;它一无所获。它所做的就是从更少获得更多。像所有工程学一样，编程工作量很大：我们必须从头开始构建所有内容。学习更像是耕种，让自然完成大部分工作。农民将种子与养分结合起来种植农作物。学习者将知识与数据相结合以开发程序。 Overfitting Has Many Faces  What if the knowledge and data we  have are not sufficient to completely  determine the correct classifier? Then  we run the risk of just hallucinating  a classifier (or parts of it) that is not  grounded in reality, and is simply encoding  random quirks in the data.  This problem is called overfitting, and  is the bugbear of machine learning.  When your learner outputs a classifier  that is 100% accurate on the training  data but only 50% accurate on test  data, when in fact it could have output one that is 75% accurate on both, it  has overfit.   Everyone in machine learning  knows about overfitting, but it comes  in many forms that are not immediately  obvious. One way to understand  overfitting is by decomposing generalization  error into bias and variance.  9  Bias is a learner’s tendency to consistently  learn the same wrong thing.  Variance is the tendency to learn random  things irrespective of the real signal.  Figure 1 illustrates this by an analogy  with throwing darts at a board. A  linear learner has high bias, because  when the frontier between two classes  is not a hyperplane the learner is unable  to induce it. Decision trees do not  have this problem because they can  represent any Boolean function, but  on the other hand they can suffer from  high variance: decision trees learned  on different training sets generated by  the same phenomenon are often very  different, when in fact they should be the same. Similar reasoning applies  to the choice of optimization method:  beam search has lower bias than  greedy search, but higher variance, because  it tries more hypotheses. Thus,  contrary to intuition, a more powerful  learner is not necessarily better than a  less powerful one.   Figure 2 illustrates this.a   Even  though the true classifier is a set of  rules, with up to 1,000 examples naive  Bayes is more accurate than a  rule learner. This happens despite  naive Bayes’s false assumption that  the frontier is linear! Situations like  this are common in machine learning:  strong false assumptions can be  better than weak true ones, because  a learner with the latter needs more  data to avoid overfitting. 过度拟合有很多面孔如果我们所掌握的知识和数据不足以完全确定正确的分类器怎么办？然后，我们冒着使幻化一个分类器（或部分分类器）的风险，而这个分类器（或部分分类器）实际上并没有扎根，只是在数据中编码了随机的怪癖。这个问题称为过拟合，是机器学习的负担。当您的学习者输出的分类数据在训练数据上准确度为100％但在测试数据上仅准确度为50％时，实际上它可能在两个数据上都输出准确度为75％的分类器，这是过拟合的。 机器学习中的每个人都知道过拟合，但是它以许多形式出现，但并不是立即显而易见的。一种理解过度拟合的方法是将泛化误差分解为偏差和方差。 9偏见是学习者始终如一地学习同一错误事物的倾向。方差是学习随机事物的趋势，与真实信号无关。图1通过在板上扔飞镖的类比说明了这一点。线性学习者具有较高的偏见，因为当两类之间的边界不是超平面时，学习者无法诱导它。决策树没有这个问题，因为它们可以表示任何布尔函数，但是另一方面，它们可能遭受高方差：在相同现象产生的不同训练集上学习的决策树通常非常不同，而实际上它们应该是相同。类似的推理适用于优化方法的选择：光束搜索比贪婪搜索具有更低的偏差，但方差更高，因为它会尝试更多的假设。因此，与直觉相反，更强大的学习者不一定比不那么强大的学习者更好。 图2对此进行了说明。a尽管真正的分类器是一组规则，但多达1000个示例的朴素贝叶斯比一个规则学习者更准确。尽管天真贝叶斯错误地认为边界是线性的，但仍会发生这种情况！像这样的情况在机器学习中很常见：强错误的假设可能比弱真实的假设更好，因为拥有后者的学习者需要更多数据来避免过拟合。 Cross-validation can help to combat  overfitting, for example by using it  to choose the best size of decision tree  to learn. But it is no panacea, since if  we use it to make too many parameter  choices it can itself start to overfit.17   Besides cross-validation, there  are many methods to combat overfitting.  The most popular one is adding  a regularization term to the evaluation  function. This can, for example, penalize  classifiers with more structure,  thereby favoring smaller ones with  less room to overfit. Another option  is to perform a statistical significance  test like chi-square before adding new  structure, to decide whether the distribution  of the class really is different  with and without this structure.  These techniques are particularly useful  when data is very scarce. Nevertheless,  you should be skeptical of claims  that a particular technique “solves”  the overfitting problem. It is easy to  avoid overfitting (variance) by falling  into the opposite error of underfitting  (bias). Simultaneously avoiding both  requires learning a perfect classifier,  and short of knowing it in advance  there is no single technique that will  always do best (no free lunch).   A common misconception about  overfitting is that it is caused by noise,like training examples labeled with  the wrong class. This can indeed aggravate  overfitting, by making the  learner draw a capricious frontier to  keep those examples on what it thinks  is the right side. But severe overfitting  can occur even in the absence of noise.  For instance, suppose we learn a Boolean  classifier that is just the disjunction  of the examples labeled “true”  in the training set. (In other words,  the classifier is a Boolean formula in  disjunctive normal form, where each  term is the conjunction of the feature  values of one specific training example.)  This classifier gets all the training  examples right and every positive test  example wrong, regardless of whether  the training data is noisy or not.   The problem of multiple testing13 is  closely related to overfitting. Standard  statistical tests assume that only one  hypothesis is being tested, but modern  learners can easily test millions  before they are done. As a result what  looks significant may in fact not be.  For example, a mutual fund that beats  the market 10 years in a row looks very  impressive, until you realize that, if  there are 1,000 funds and each has a  50% chance of beating the market on  any given year, it is quite likely that  one will succeed all 10 times just by  luck. This problem can be combatted  by correcting the significance tests to  take the number of hypotheses into  account, but this can also lead to underfitting.  A better approach is to control  the fraction of falsely accepted  non-null hypotheses, known as the  false discovery rate.  3 交叉验证可以帮助克服过度拟合，例如通过使用交叉验证来选择要学习的最佳决策树大小。但这不是万能药，因为如果我们使用它进行过多的参数选择，它本身可能会开始过度适应.17 除了交叉验证外，还有许多方法可以防止过度拟合。最受欢迎的一种是在评估函数中添加正则化项。例如，这可能会惩罚具有更多结构的分类器，从而偏向于具有较小空间以适合过度的较小分类器。另一种选择是在添加新结构之前执行卡方检验等统计显着性检验，以判断使用和不使用此结构时类的分布是否确实不同。当数据非常稀缺时，这些技术特别有用。尽管如此，您应该对特定技术可以解决''过拟合问题的说法持怀疑态度。通过陷入欠拟合（bias）的相反误差很容易避免过拟合（variance）。同时避免同时需要学习一个完美的分类器和既不事先知道它又没有一种技术会永远做到最好（没有免费的午餐）。 关于过度拟合的一个常见误解是它是由噪声引起的，例如带有错误课程的训练示例。通过使学习者画出一个反复无常的疆界以使那些例子保持正确的观点，确实可以加剧过度拟合。但是即使没有噪音也会发生严重的过拟合。例如，假设我们学习了一个布尔分类器，它只是训练集中标注为true''的示例的分离。 （换句话说，分类器是布尔正则形式的布尔公式，其中每个术语是一个特定训练示例的特征值的合取）。该分类器获得正确的所有训练示例，每个阳性检验示例都正确，无论训练数据是否嘈杂。 多次测试的问题13与过度拟合密切相关。标准统计测试假设仅对一种假设进行了测试，但是现代学习者可以在完成之前轻松地测试数百万个假设。因此，看似重要的事实实际上可能并非如此。例如，一个连续十年击败市场的共同基金看起来非常令人印象深刻，直到您意识到，如果有1,000只基金，并且每种都有在任何给定年份击败市场的50％的机会，那么很可能仅靠运气，一个人就能成功十次。可以通过校正显着性检验以将假设的数量纳入考虑范围来解决此问题，但这也可能导致拟合不足。更好的方法是控制被错误接受的非零假设的比例，即错误发现率。 3

## Intuition Fails in High Dimensions  高维直觉失败

 After overfitting, the biggest problem  in machine learning is the curse of  dimensionality. This expression was  coined by Bellman in 1961 to refer  to the fact that many algorithms that  work fine in low dimensions become  intractable when the input is highdimensional.  But in machine learning  it refers to much more. Generalizing  correctly becomes exponentially  harder as the dimensionality (number  of features) of the examples grows, because  a fixed-size training set covers a  dwindling fraction of the input space.  Even with a moderate dimension of  100 and a huge training set of a trillion  examples, the latter covers only a fraction of about 10−18 of the input space.  This is what makes machine learning  both necessary and hard.   More seriously, the similaritybased  reasoning that machine learning  algorithms depend on (explicitly  or implicitly) breaks down in high dimensions.  Consider a nearest neighbor  classifier with Hamming distance  as the similarity measure, and suppose  the class is just x1 ∧ x2. If there  are no other features, this is an easy  problem. But if there are 98 irrelevant  features x3,..., x100, the noise from  them completely swamps the signal in  x1 and x2, and nearest neighbor effectively  makes random predictions.   Even more disturbing is that nearest  neighbor still has a problem even  if all 100 features are relevant! This  is because in high dimensions all  examples look alike. Suppose, for  instance, that examples are laid out  on a regular grid, and consider a test  example xt. If the grid is d-dimensional,  xt’s 2d nearest examples are  all at the same distance from it. So as  the dimensionality increases, more  and more examples become nearest  neighbors of xt, until the choice of  nearest neighbor (and therefore of  class) is effectively random. 经过过度拟合后，机器学习中最大的问题就是维度的诅咒。该表达式由Bellman于1961年创造，是指在输入为高维输入时许多在低维运行良好的算法变得棘手的事实。但是在机器学习中，它涉及的更多。随着示例维数（特征数量）的增长，正确地进行概括变得越来越困难，因为固定大小的训练集覆盖了输入空间的缩小部分。即使具有100的适度范围和数以万亿计的示例的庞大训练集，后者仅覆盖了约10-18的输入空间的一小部分。这就是使机器学习既必要又困难的原因。 更严重的是，机器学习算法所依赖的基于相似度的原因（明确或隐含地）在高维度上被分解。考虑具有汉明距离的最近邻居分类器作为相似性度量，并假设该类仅为x1∧x2。如果没有其他功能，这是一个简单的问题。但是，如果x3，...，x100有98个不相关的功能，则来自它们的噪声会完全淹没x1和x2中的信号，并且最近的邻居会有效地进行随机预测。 更令人不安的是，即使所有100个功能都相关，最近的邻居仍然有问题！这是因为在高维度上所有示例看起来都是相似的。例如，假设示例被放置在规则的网格上，并考虑一个测试示例xt。如果网格是d维的，则xt的2d最接近的示例都与网格距离相同。因此，随着维数的增加，越来越多的示例成为xt的最接近邻居，直到最近邻居（以及类别）的选择实际上是随机的。 This is only one instance of a more  general problem with high dimensions:  our intuitions, which come  from a three-dimensional world, often  do not apply in high-dimensional  ones. In high dimensions, most of the  mass of a multivariate Gaussian distribution  is not near the mean, but in  an increasingly distant “shell” around  it; and most of the volume of a highdimensional  orange is in the skin, not  the pulp. If a constant number of examples  is distributed uniformly in a  high-dimensional hypercube, beyond  some dimensionality most examples  are closer to a face of the hypercube  than to their nearest neighbor. And if  we approximate a hypersphere by inscribing  it in a hypercube, in high dimensions  almost all the volume of the  hypercube is outside the hypersphere.  This is bad news for machine learning,  where shapes of one type are often approximated  by shapes of another.   Building a classifier in two or three  dimensions is easy; we can find a reasonable  frontier between examples  of different classes just by visual inspection. (It has even been said that if  people could see in high dimensions  machine learning would not be necessary.)  But in high dimensions it is difficult  to understand what is happening.  This in turn makes it difficult to  design a good classifier. Naively, one  might think that gathering more features  never hurts, since at worst they  provide no new information about the  class. But in fact their benefits may  be outweighed by the curse of dimensionality.   Fortunately, there is an effect that  partly counteracts the curse, which  might be called the “blessing of nonuniformity.”  In most applications  examples are not spread uniformly  throughout the instance space, but  are concentrated on or near a lowerdimensional  manifold. For example,  k-nearest neighbor works quite well  for handwritten digit recognition  even though images of digits have  one dimension per pixel, because the  space of digit images is much smaller  than the space of all possible images.  Learners can implicitly take advantage  of this lower effective dimension,  or algorithms for explicitly reducing  the dimensionality can be used (for  example, Tenenbaum22). 这只是一个更高维度的一般性问题的一个例子：我们的直觉来自三维世界，通常不适用于高维度的直觉。在高维中，多元高斯分布的大部分质量都不在均值附近，而是在其周围越来越远的壳''中;高维橙的大部分体积在皮肤中，而不是果肉中。如果恒定数量的示例均匀分布在一个高维超立方体中，则除了某些维之外，大多数示例比其最近的邻居更靠近超立方体的一面。并且，如果我们通过将其记录在超立方体中来近似超球面，则在高维中，几乎所有超立方体的体积都在超球面之外。这对于机器学习来说是个坏消息，其中一种类型的形状通常被另一种形状的形状近似。 在两个或三个维度中建立分类器很容易;我们可以通过目视检查在不同类别的示例之间找到合理的边界。 （甚至有人说，如果人们可以在高维度上看到机器学习是没有必要的。）但是在高维度上，很难理解正在发生的事情。反过来，这使得设计好的分类器变得困难。天真的，一个人可能认为收集更多功能永远不会有害，因为在最坏的情况下，它们不提供有关该类的新信息。但是实际上，它们的好处可能会因维数的诅咒而被抵消。 幸运的是，有一种效果可以部分抵消这种诅咒，这种诅咒可能被称为“不均匀的祝福”。在大多数应用程序中，示例并非均匀分布在整个实例空间中，而是集中在低维流形上或附近。例如，即使数字图像每像素具有一维尺寸，k近邻也能很好地用于手写数字识别，因为数字图像的空间比所有可能图像的空间小得多。学习者可以隐式地利用此较低的有效维度，或者可以使用显式降低维度的算法（例如Tenenbaum22）。

## Feature Engineering Is The Key  特征工程是关键

 A dumb algorithm  with lots and lots  of data beats  a clever one  with modest  amounts of it. 具有大量数据的愚蠢算法击败了数量适中的聪明算法。 At the end of the day, some machine  learning projects succeed and some  fail. What makes the difference? Easily  the most important factor is the  features used. Learning is easy if you  have many independent features that  each correlate well with the class. On  the other hand, if the class is a very  complex function of the features, you  may not be able to learn it. Often, the  raw data is not in a form that is amenable  to learning, but you can construct  features from it that are. This  is typically where most of the effort in  a machine learning project goes. It is  often also one of the most interesting  parts, where intuition, creativity and  “black art” are as important as the  technical stuff.   First-timers are often surprised by  how little time in a machine learning  project is spent actually doing machine learning. But it makes sense if  you consider how time-consuming it  is to gather data, integrate it, clean it  and preprocess it, and how much trial  and error can go into feature design.  Also, machine learning is not a oneshot  process of building a dataset and  running a learner, but rather an iterative  process of running the learner,  analyzing the results, modifying the  data and/or the learner, and repeating.  Learning is often the quickest  part of this, but that is because we  have already mastered it pretty well!  Feature engineering is more difficult  because it is domain-specific,  while learners can be largely general  purpose. However, there is no sharp  frontier between the two, and this is  another reason the most useful learners  are those that facilitate incorporating  knowledge.   Of course, one of the holy grails  of machine learning is to automate  more and more of the feature engineering  process. One way this is often  done today is by automatically generating  large numbers of candidate features  and selecting the best by (say)  their information gain with respect  to the class. But bear in mind that  features that look irrelevant in isolation  may be relevant in combination.  For example, if the class is an XOR of  k input features, each of them by itself  carries no information about the  class. (If you want to annoy machine  learners, bring up XOR.) On the other  hand, running a learner with a very  large number of features to find out  which ones are useful in combination  may be too time-consuming, or cause  overfitting. So there is ultimately no  replacement for the smarts you put  into feature engineering. 最终，一些机器学习项目成功了而有些失败了。有什么区别？最重要的因素很容易就是所使用的功能。如果您具有许多与班级紧密相关的独立功能，则学习将很容易。另一方面，如果该类是功能的非常复杂的功能，则您可能无法学习它。通常，原始数据的形式不适合学习，但您可以从中构造特征。这通常是机器学习项目中大部分工作的去向。它通常也是最有趣的部分之一，直觉，创造力和“妖术”与技术同样重要。初学者通常会对机器学习项目中实际用于机器学习的时间很少感到惊讶。但是，如果您考虑收集数据，集成，清理和预处理数据要花多长时间，以及可以在功能设计中进行多少试验和错误，这是有道理的。此外，机器学习不是构建数据集和运行学习者的一站式过程，而是运行学习者，分析结果，修改数据和/或学习者并重复的迭代过程。学习通常是其中最快的部分，但这是因为我们已经很好地掌握了它！特征工程更加困难，因为它是特定于领域的，而学习者在很大程度上可能是通用的。但是，两者之间没有敏锐的疆界，这是最有用的学习者是那些有助于整合知识的学习者的另一个原因。当然，机器学习的圣地之一是使越来越多的特征工程过程自动化。今天通常这样做的一种方式是通过自动生成大量候选特征并通过（比如说）它们相对于类的信息增益来选择最佳特征。但是请记住，孤立地看起来无关紧要的功能可能会组合在一起使用。例如，如果类别是k个输入要素的XOR，则每个类别本身都不携带有关类别的信息。 （如果要惹恼机器学习者，请调出XOR。）另一方面，运行具有大量功能的学习器以找出哪些功能组合在一起可能会非常耗时，或导致过度拟合。因此，您投入功能工程的智能最终无法替代。

## More Data Beats  a Cleverer Algorithm  智慧算法带来更多数据优势

 Suppose you have constructed the  best set of features you can, but the  classifiers you receive are still not accurate  enough. What can you do now?  There are two main choices: design a  better learning algorithm, or gather  more data (more examples, and possibly  more raw features, subject to  the curse of dimensionality). Machine  learning researchers are mainly concerned  with the former, but pragmatically  the quickest path to success is often to just get more data. As a rule  of thumb, a dumb algorithm with lots  and lots of data beats a clever one with  modest amounts of it. (After all, machine  learning is all about letting data  do the heavy lifting.)  This does bring up another problem,  however: scalability. In most of  computer science, the two main limited  resources are time and memory.  In machine learning, there is a third  one: training data. Which one is the  bottleneck has changed from decade  to decade. In the 1980s it tended to  be data. Today it is often time. Enormous  mountains of data are available,  but there is not enough time  to process it, so it goes unused. This  leads to a paradox: even though in  principle more data means that more  complex classifiers can be learned, in  practice simpler classifiers wind up  being used, because complex ones  take too long to learn. Part of the answer  is to come up with fast ways to  learn complex classifiers, and indeed  there has been remarkable progress  in this direction (for example, Hulten  and Domingos11).   Part of the reason using cleverer  algorithms has a smaller payoff than  you might expect is that, to a first approximation,  they all do the same.  This is surprising when you consider  representations as different as, say,  sets of rules and neural networks. But  in fact propositional rules are readily  encoded as neural networks, and similar  relationships hold between other  representations. All learners essentially  work by grouping nearby examples  into the same class; the key difference  is in the meaning of “nearby.”  With nonuniformly distributed data,  learners can produce widely different  frontiers while still making the same  predictions in the regions that matter  (those with a substantial number of  training examples, and therefore also  where most test examples are likely to  appear). This also helps explain why  powerful learners can be unstable but  still accurate. Figure 3 illustrates this  in 2D; the effect is much stronger in  high dimensions. 假设您已经构建了最好的功能集，但是收到的分类器仍然不够准确。你现在可以做什么？有两个主要选择：设计更好的学习算法，或收集更多数据（更多示例，可能还有更多原始特征，这取决于维度的诅咒）。机器学习研究人员主要与前者有关，但务实地，成功的最快途径通常是获取更多数据。根据经验，具有大量数据的愚蠢算法要击败数量适中的聪明算法。 （毕竟，机器学习只不过是让数据繁重而已。）但这确实带来了另一个问题：可伸缩性。在大多数计算机科学中，两个主要的有限资源是时间和内存。在机器学习中，有三分之一是训练数据。哪个瓶颈已经从十年改变到了十年。在1980年代，它倾向于成为数据。今天是时候了。有大量的数据可用，但是没有足够的时间来处理它，因此它没有被使用。这导致了一个悖论：尽管在原则上更多的数据意味着可以学习更多的复杂分类器，但在实践中却使用了更简单的分类器，因为复杂的分类器学习时间太长。答案的一部分是想出一种快速的方法来学习复杂的分类器，实际上在这个方向上已经取得了显着的进展（例如Hulten和Domingos11）。 使用更聪明算法的部分原因是收益比您预期的要小，这是一个近似值。当您认为表示形式与规则集和神经网络不同时，这令人惊讶。但是实际上命题规则很容易被编码为神经网络，并且其他表示之间也存在类似的关系。本质上，所有学习者都通过将附近的示例分组到同一个班级中来工作;关键区别在于“附近”的含义。使用非均匀分布的数据，学习者可以产生非常不同的边界，同时仍可以在重要区域做出相同的预测（那些区域具有大量的训练示例，因此也有可能出现大多数测试示例）。这也有助于解释为什么强大的学习者可能不稳定但仍然准确。图3以2D形式说明了这一点;在高尺寸时效果更强。 As a rule, it pays to try the simplest  learners first (for example, naïve Bayes  before logistic regression, k-nearest  neighbor before support vector machines).  More sophisticated learn ers are seductive, but they are usually  harder to use, because they have more  knobs you need to turn to get good results,  and because their internals are  more opaque.   Learners can be divided into two  major types: those whose representation  has a fixed size, like linear classifiers,  and those whose representation  can grow with the data, like decision  trees. (The latter are sometimes called  nonparametric learners, but this is  somewhat unfortunate, since they  usually wind up learning many more  parameters than parametric ones.)  Fixed-size learners can only take advantage  of so much data. (Notice how  the accuracy of naive Bayes asymptotes  at around 70% in Figure 2.) Variablesize  learners can in principle learn any  function given sufficient data, but in  practice they may not, because of limitations  of the algorithm (for example,  greedy search falls into local optima)  or computational cost. Also, because  of the curse of dimensionality, no existing  amount of data may be enough.  For these reasons, clever algorithms—  those that make the most of the data  and computing resources available—  often pay off in the end, provided you  are willing to put in the effort. There  is no sharp frontier between designing  learners and learning classifiers;  rather, any given piece of knowledge  could be encoded in the learner or  learned from data. So machine learning  projects often wind up having a  significant component of learner design,  and practitioners need to have  some expertise in it.12 In the end, the biggest bottleneck is not data or CPU cycles, but human cycles. In research papers, learners  are typically compared on measures  of accuracy and computational cost.  But human effort saved and insight  gained, although harder to measure,  are often more important. This favors  learners that produce human-understandable  output (for example, rule  sets). And the organizations that make  the most of machine learning are  those that have in place an infrastructure  that makes experimenting with  many different learners, data sources,  and learning problems easy and efficient,  and where there is a close collaboration  between machine learning  experts and application domain ones. 通常，首先尝试最简单的学习者是值得的（例如，逻辑回归之前的朴素贝叶斯，支持向量机之前的k近邻）。经验丰富的学习者很诱人，但它们通常更难使用，因为它们具有更多的旋钮，您需要转向以获得良好的效果，并且它们的内部更加不透明。 学习者可以分为两种主要类型：那些具有固定大小的表示形式（如线性分类器）和那些随着数据增长的表示形式（如决策树）。 （后者有时被称为非参数学习者，但这有点不幸，因为他们通常要比参数学习更多的参数。）固定大小的学习者只能利用这么多数据。 （请注意，图2中朴素的贝叶斯渐近线的准确度约为70％）。可变大小的学习者原则上可以在给定足够数据的情况下学习任何函数，但由于算法的限制，实际上它们可能无法学习任何函数（例如，贪婪搜索下降转化为局部最优值）或计算成本。同样，由于维数的诅咒，现有的数据量可能不足。由于这些原因，只要您愿意付出努力，聪明的算法-那些可以充分利用可用数据和计算资源的算法通常会最终获得回报。设计学习者和学习分类器之间没有前沿的界限;相反，任何给定的知识都可以在学习者中进行编码或从数据中学习。因此，机器学习项目通常会包含学习者设计的重要组成部分，并且从业者需要在其中拥有一些专业知识.12 最后，最大的瓶颈不是数据或CPU周期，而是人员周期。在研究论文中，通常会比较学习者的准确性和计算成本。但是，尽管难以衡量，但节省了人力并获得见识通常更重要。这有利于产生人类可理解的输出（例如规则集）的学习者。充分利用机器学习的组织是那些拥有适当基础设施的组织，这些基础设施使对许多不同的学习者，数据源和学习问题的实验变得容易而高效，并且机器学习专家和应用程序领域之间存在密切的协作那些。

## Learn Many Models, Not Just One  学习多种模型，而不仅仅是一种

 In the early days of machine learning,  everyone had a favorite learner,  together with some a priori reasons  to believe in its superiority. Most effort  went into trying many variations  of it and selecting the best one. Then  systematic empirical comparisons  showed that the best learner varies  from application to application, and  systems containing many different  learners started to appear. Effort now  went into trying many variations of  many learners, and still selecting just  the best one. But then researchers  noticed that, if instead of selecting  the best variation found, we combine  many variations, the results are better—often  much better—and at little  extra effort for the user.   Creating such model ensembles is  now standard.1   In the simplest technique,  called bagging, we simply generate  random variations of the training  set by resampling, learn a classifier  on each, and combine the results by  voting. This works because it greatly  reduces variance while only slightly  increasing bias. In boosting, training  examples have weights, and these are  varied so that each new classifier focuses  on the examples the previous  ones tended to get wrong. In stacking,  the outputs of individual classifiers  become the inputs of a “higher-level”  learner that figures out how best to  combine them. 在机器学习的早期，每个人都有一个喜欢的学习者，加上一些先验的理由相信它的优越性。最努力的尝试是尝试它的多种变体并选择最佳的一种。然后系统的经验比较表明，最佳学习者因应用程序而异，并且包含许多不同学习者的系统开始出现。现在，我们努力尝试许多学习者的许多变体，但仍然只选择最好的一个。但是随后研究人员注意到，如果我们不选择发现的最佳变体，而是结合许多变体，则结果会更好-通常更好得多-并且对用户来说几乎没有额外的精力。 创建这样的模型集成现在是标准的.1。在最简单的技术（称为装袋）中，我们只需通过重新采样就可以生成训练集的随机变化，在每个学习一个分类器，然后通过投票合并结果。之所以行之有效，是因为它大大减少了方差，而偏差却稍有增加。在增强方面，训练示例具有权重，并且这些权重是可变的，因此每个新分类器都将重点放在示例上，而先前的那些往往会出错。在堆叠中，单个分类器的输出成为一个高级''学习器的输入，该学习器找出了如何最好地组合它们。 Many other techniques exist, and  the trend is toward larger and larger  ensembles. In the Netflix prize, teams  from all over the world competed to  build the best video recommender system (http://netflixprize.com). As  the competition progressed, teams  found they obtained the best results  by combining their learners with other  teams’, and merged into larger and  larger teams. The winner and runnerup  were both stacked ensembles of  over 100 learners, and combining the  two ensembles further improved the  results. Doubtless we will see even  larger ones in the future.   Model ensembles should not be  confused with Bayesian model averaging  (BMA)—the theoretically  optimal approach to learning.4   In  BMA, predictions on new examples  are made by averaging the individual  predictions of all classifiers in the  hypothesis space, weighted by how  well the classifiers explain the training  data and how much we believe  in them a priori. Despite their superficial  similarities, ensembles and  BMA are very different. Ensembles  change the hypothesis space (for example,  from single decision trees to  linear combinations of them), and  can take a wide variety of forms. BMA  assigns weights to the hypotheses in  the original space according to a fixed  formula. BMA weights are extremely  different from those produced by  (say) bagging or boosting: the latter  are fairly even, while the former are  extremely skewed, to the point where  the single highest-weight classifier  usually dominates, making BMA effectively  equivalent to just selecting  it.8   A practical consequence of this is  that, while model ensembles are a key  part of the machine learning toolkit,  BMA is seldom worth the trouble. 存在许多其他技术，并且趋势正在越来越大。在Netflix奖项中，来自世界各地的团队竞争建立了最佳的视频推荐系统（http://netflixprize.com）。随着比赛的进行，团队发现他们通过将学习者与其他团队''结合在一起而获得了最佳成绩，并合并为越来越大的团队。获胜者和亚军都是超过100名学习者的堆叠乐团，并且将这两个乐团结合起来可以进一步改善结果。毫无疑问，我们将来会看到更大的机型。 模型集合不应与贝叶斯模型平均（BMA）混淆-理论上的最佳学习方法.4在BMA中，对新示例的预测是通过对假设空间中所有分类器的各个预测取平均，并通过对分类器加权的方式得出的解释训练数据以及我们对它们有先验的信任度。尽管它们在表面上有相似之处，但合奏和BMA却有很大不同。集合更改假设空间（例如，从单个决策树更改为它们的线性组合），并且可以采用多种形式。 BMA根据固定公式将权重分配给原始空间中的假设。 BMA权重与（例如）装袋或提升产生的权重极为不同：后者相当均匀，而前者则极为偏斜，以至于单个最高权重的分类器通常占主导地位，这使得BMA有效地等同于仅选择它.8的实际结果是，虽然模型集成是机器学习工具包的关键部分，但BMA很少值得为此烦恼。

## Simplicity Does Not  Imply Accuracy  简易性不准确

 Just because  a function can  be represented  does not mean  it can be learned. 仅仅因为可以表示一个函数并不意味着可以学习它。 Occam’s razor famously states that  entities should not be multiplied beyond  necessity. In machine learning,  this is often taken to mean that, given  two classifiers with the same training  error, the simpler of the two will likely  have the lowest test error. Purported  proofs of this claim appear regularly  in the literature, but in fact there are  many counterexamples to it, and the  “no free lunch” theorems imply it cannot  be true.  We saw one counterexample previously:  model ensembles. The generalization  error of a boosted ensemble continues to improve by adding classifiers  even after the training error has  reached zero. Another counterexample  is support vector machines, which  can effectively have an infinite number  of parameters without overfitting.  Conversely, the function sign(sin(ax))  can discriminate an arbitrarily large,  arbitrarily labeled set of points on the  x axis, even though it has only one parameter.23  Thus, contrary to intuition,  there is no necessary connection between  the number of parameters of a  model and its tendency to overfit. 奥卡姆（Occam）的剃刀著名地指出，实体不应超出必需的数量。在机器学习中，这通常是指给定两个具有相同训练误差的分类器，两个简单的分类器可能具有最低的测试误差。据称该主张的证据在文献中经常出现，但实际上有很多反例，并且无免费午餐''定理表明它不成立。之前我们看到了一个反例：模型集成。甚至在训练误差达到零之后，通过添加分类器，仍可以提高增强的合奏的泛化误差。另一个反例是支持向量机，它可以有效地具有无限数量的参数而不会过度拟合。相反，功能符号（sin（ax））可以区分x轴上任意大的，带有标签的点集，尽管它只有一个参数.23因此，与直觉相反，数字之间没有必要的联系模型的参数及其过度拟合的趋势。 A more sophisticated view instead  equates complexity with the size of  the hypothesis space, on the basis that  smaller spaces allow hypotheses to be  represented by shorter codes. Bounds  like the one in the section on theoretical  guarantees might then be viewed  as implying that shorter hypotheses  generalize better. This can be further  refined by assigning shorter codes to  the hypotheses in the space we have  some a priori preference for. But  viewing this as “proof” of a trade-off  between accuracy and simplicity is  circular reasoning: we made the hypotheses  we prefer simpler by design,  and if they are accurate it is because  our preferences are accurate, not because  the hypotheses are “simple” in  the representation we chose.   A further complication arises from  the fact that few learners search their  hypothesis space exhaustively. A  learner with a larger hypothesis space  that tries fewer hypotheses from it  is less likely to overfit than one that  tries more hypotheses from a smaller  space. As Pearl18 points out, the size of  the hypothesis space is only a rough  guide to what really matters for relating  training and test error: the procedure  by which a hypothesis is chosen.   Domingos7   surveys the main arguments  and evidence on the issue of  Occam’s razor in machine learning.  The conclusion is that simpler hypotheses  should be preferred because  simplicity is a virtue in its own right,  not because of a hypothetical connection  with accuracy. This is probably  what Occam meant in the first place. 相反，一个更复杂的视图将复杂度与假设空间的大小等同起来，其依据是较小的空间允许用较短的代码表示假设。像理论保证一节中所述的界限可能会被认为暗示着较短的假设通常会更好。可以通过为我们具有某些先验偏好的空间中的假设分配较短的代码来进一步完善。但是，将其视为在准确性和简单性之间进行权衡的证明''是循环推理：我们通过设计使假设变得更简单，如果假设是准确的，那是因为我们的偏好是准确的，而不是因为假设是简单的''在我们选择的表示形式中。 更为复杂的是由于几乎没有学习者详尽地搜索其假设空间这一事实。拥有较大假设空间的学习者从中尝试较少的假设的可能性比从较小空间尝试更多假设的学习者的过拟合可能性小。正如Pearl18所指出的那样，假设空间的大小仅是对与训练和测试误差有关的真正重要性的粗略指导：选择假设的过程。 Domingos7调查了有关机器学习中Occam剃刀问题的主要论点和证据。结论是，应采用更简单的假设，因为简单性本身就是一种美德，而不是因为假设与准确性之间的联系。这可能是Occam首先的意思。

## Representable Does Not  Imply Learnable  有代表性的不容易学习

 Essentially all representations used in  variable-size learners have associated theorems of the form “Every function  can be represented, or approximated  arbitrarily closely, using this representation.”  Reassured by this, fans of  the representation often proceed to  ignore all others. However, just because  a function can be represented  does not mean it can be learned. For  example, standard decision tree learners  cannot learn trees with more leaves  than there are training examples. In  continuous spaces, representing even  simple functions using a fixed set of  primitives often requires an infinite  number of components. Further, if  the hypothesis space has many local  optima of the evaluation function, as  is often the case, the learner may not  find the true function even if it is representable.  Given finite data, time and  memory, standard learners can learn  only a tiny subset of all possible functions,  and these subsets are different  for learners with different representations.  Therefore the key question is  not “Can it be represented?” to which  the answer is often trivial, but “Can it  be learned?” And it pays to try different  learners (and possibly combine them).   Some representations are exponentially  more compact than others for  some functions. As a result, they may  also require exponentially less data to  learn those functions. Many learners  work by forming linear combinations  of simple basis functions. For example,  support vector machines form  combinations of kernels centered at  some of the training examples (the  support vectors). Representing parity  of n bits in this way requires 2n   basis  functions. But using a representation  with more layers (that is, more steps  between input and output), parity can  be encoded in a linear-size classifier.  Finding methods to learn these deeper  representations is one of the major research  frontiers in machine learning.2 基本上，可变大小学习器中使用的所有表示形式都有相关的定理，形式为可以使用该表示形式来表示或任意近似地逼近每个函数''。对此感到放心的是，代表制的支持者经常会忽略其他所有人。但是，仅仅因为可以表示一个函数并不意味着可以学习它。例如，标准决策树学习者无法学习叶子多于训练实例的树。在连续空间中，使用一组固定的基元表示甚至简单的函数通常需要无限数量的组件。此外，如果假设空间具有很多评估函数的局部最优值（通常是这样），则学习者即使可以表示，也可能找不到真正的函数。给定有限的数据，时间和内存，标准学习者只能学习所有可能功能的一小部分，而对于具有不同表示形式的学习者来说，这些子集是不同的。因此，关键问题不是“能否代表？”答案通常是微不足道的，但是“可以学习吗？”尝试不同的学习者（并可能将他们合并）是值得的。 对于某些功能，某些表示形式比其他表示形式更紧凑。结果，他们可能还需要以指数形式减少的数据来学习这些功能。许多学习者通过形成简单基函数的线性组合来工作。例如，支持向量机形成了以一些训练示例（支持向量）为中心的内核组合。以这种方式表示n位的奇偶校验需要2n个基函数。但是使用具有更多层的表示（即输入和输出之间的更多步骤），可以将奇偶校验编码为线性大小的分类器。寻找学习这些更深层表示的方法是机器学习的主要研究领域之一.2

## Correlation Does Not  Imply Causation  关联不表示因果关系

 The point that correlation does not  imply causation is made so often that  it is perhaps not worth belaboring.  But, even though learners of the kind  we have been discussing can only  learn correlations, their results are  often treated as representing causal  relations. Isn’t this wrong? If so, then  why do people do it? More often than not, the goal  of learning predictive models is to  use them as guides to action. If we  find that beer and diapers are often  bought together at the supermarket,  then perhaps putting beer next  to the diaper section will increase  sales. (This is a famous example in  the world of data mining.) But short  of actually doing the experiment it is  difficult to tell. Machine learning is  usually applied to observational data,  where the predictive variables are not  under the control of the learner, as  opposed to experimental data, where  they are. Some learning algorithms  can potentially extract causal information  from observational data, but  their applicability is rather restricted.19  On the other hand, correlation  is a sign of a potential causal connection,  and we can use it as a guide to  further investigation (for example,  trying to understand what the causal  chain might be). 关联并不意味着因果关系如此频繁，这一点也许值得我们去研究。但是，即使我们一直在讨论的那种学习者只能学习相关性，他们的结果也经常被视为代表因果关系。这不是错吗？如果是这样，那么人们为什么这样做？ 通常，学习预测模型的目的是将其用作行动指南。如果我们发现啤酒和尿布经常是在超市一起买的，那么也许把啤酒放在尿布下面会增加销量。 （这是数据挖掘世界中的一个著名示例。）但是实际上很难进行实验，很难说清楚。机器学习通常应用于观测数据，而预测变量不在学习者的控制之下，而与实验数据相反。一些学习算法可能会从观测数据中提取因果信息，但其适用性受到限制.19另一方面，相关性是潜在因果关系的标志，我们可以将其用作进一步调查的指南（例如，试图了解原因链可能是什么）。 Many researchers believe that causality  is only a convenient fiction. For  example, there is no notion of causality  in physical laws. Whether or not  causality really exists is a deep philosophical  question with no definitive  answer in sight, but there are two  practical points for machine learners.  First, whether or not we call them  “causal,” we would like to predict the  effects of our actions, not just correlations  between observable variables.  Second, if you can obtain experimental  data (for example by randomly assigning  visitors to different versions of  a Web site), then by all means do so.14 许多研究人员认为因果关系只是一种方便的小说。 例如，物理定律中没有因果关系的概念。 因果关系是否真的存在是一个深刻的哲学问题，没有明确的答案，但是对于机器学习者来说有两个实践点。 首先，无论我们是否称其为因果关系''，我们都希望预测行为的影响，而不仅仅是可观察变量之间的相关性。 其次，如果您可以获得实验数据（例如通过将访问者随机分配给网站的不同版本），则一定要这样做.14

## Conclusion  结论

 Like any discipline, machine learning  has a lot of “folk wisdom” that can  be difficult to come by, but is crucial  for success. This article summarized  some of the most salient items. Of  course, it is only a complement to the  more conventional study of machine  learning. Check out http://www.  cs.washington.edu/homes/pedrod/  class for a complete online machine  learning course that combines formal  and informal aspects. There is also a  treasure trove of machine learning  lectures at http://www.videolectures.  net. A good open source machine  learning toolkit is Weka.24   Happy learning！ 像任何学科一样，机器学习具有许多难以获得的民间智慧''，但对于成功至关重要。 本文总结了一些最突出的项目。 当然，它只是对机器学习的更常规研究的补充。 查看http：// www。 cs.washington.edu/homes/pedrod/上一堂完整的在线机器学习课程，该课程结合了正式和非正式方面。 在http：//www.videolectures上还有一个机器学习宝库。 净。 一个很好的开源机器学习工具包是Weka.24祝您学习愉快！

08-22 8941

05-22 3万+
09-29 1万+
08-25 8276
05-22 3万+
06-06 3万+
08-28 5207
06-06 7779
02-02 2万+