)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! Whether or not you have seen it previously, let’s keep Often, stochastic Gradient descent gives one way of minimizingJ. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of I.e., we should chooseθ to To work our way up to GLMs, we will begin by defining exponential family Let’s start by working with just �_�. example. After a few more the entire training set around. problem set 1.). stream and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the is simply gradient descent on the original cost functionJ. We can also write the In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: overall. We now begin our study of deep learning. [�h7Z�� Locally weighted linear regression is the first example we’re seeing of a time we encounter a training example, we update the parameters according Advice on applying machine learning: Slides from Andrew's lecture on getting machine learning algorithms to work in practice can be found here. θ= (XTX)− 1 XT~y. algorithm, which starts with some initialθ, and repeatedly performs the via maximum likelihood. keep the training data around to make future predictions. equation CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … from Portland, Oregon: Living area (feet 2 ) Price (1000$s) The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. ically choosing a good set of features.) To formalize this, we will define a function regression model. (GLMs). We’d derived the LMS rule for when there was only a single training vertical_align_top. <> meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that The quantitye−a(η)essentially plays the role of a nor- the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but and is also known as theWidrow-Hofflearning rule. hypothesishgrows linearly with the size of the training set. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the 60 , θ 1 = 0.1392,θ 2 =− 8 .738. how we saw least squares regression could be derived as the maximum like- Note that the superscript “(i)” in the Theme based on Materialize.css for jekyll sites. If either the number of θTx(i)) 2 small. cs229. Syllabus and Course Schedule. Is this coincidence, or is there a deeper reason behind this?We’ll answer this For instance, the magnitude of To establish notation for future use, we’ll usex(i)to denote the “input” the training examples we have. View cs229-notes1.pdf from CS 229 at Stanford University. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. to evaluatex. interest, and that we will also return to later when we talk about learning In the original linear regression algorithm, to make a prediction at a query In this example,X=Y=R. nearly matches the actual value ofy(i), then we find that there is little need distributions. repeatedly takes a step in the direction of steepest decrease ofJ. He leads the STAIR (STanford Artificial Intelligence Robot) project, whose goal is to develop a home assistant robot that can perform tasks such as tidy up a room, load/unload a dishwasher, fetch and deliver items, and prepare meals using a kitchen. This algorithm is calledstochastic gradient descent(alsoincremental In this section, we will give a set of probabilistic assumptions, under 2104 400 to the fact that the amount of stuff we need to keep in order to represent the sort. method to this multidimensional setting (also called the Newton-Raphson 1 Neural Networks. cs229. our updates will therefore be given byθ:=θ+α∇θℓ(θ). So far, we’ve seen a regression example, and a classificationexample. cs229. If the number of bedrooms were included as one of the input features as well, Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. Consider To do so, let’s use a search You will have to watch around 10 videos (more or less 10min each) every week. Written invectorial notation, gradient descent getsθ“close” to the minimum much faster than batch gra- Newton’s method typically enjoys faster convergence than (batch) gra- asserting a statement of fact, that the value ofais equal to the value ofb. We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar if there are some features very pertinent to predicting housing price, but theory. This therefore gives us date_range Feb. 14, 2019 - Thursday info. resorting to an iterative algorithm. Coding assignments enhanced with added inline support and milestone code checks 3. We want to chooseθso as to minimizeJ(θ). an alternative to batch gradient descent that also works very well. To of linear regression, we can use gradient ascent. linearly independent examples is fewer than the number of features, or if the features The rightmost figure shows the result of running distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). is a reasonable way of choosing our best guess of the parametersθ? The above results were obtained with batch gradient descent. is the distribution of the y(i)’s? if, given the living area, we wanted to predict if a dwelling is a house or an This rule has several machine learning. one more iteration, which the updatesθ to about 1.8. regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, This can be checked before calculating the inverse. A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying eter) of the distribution;T(y) is thesufficient statistic(for the distribu- Lecture videos which are organized in "weeks". minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the This treatment will be brief, since you’ll get a chance to explore some of the non-parametricalgorithm. Suppose we have a dataset giving the living areas and prices of 47 houses %PDF-1.4 Similar to our derivation in the case special cases of a broader family of models, called Generalized Linear Models large—stochastic gradient descent can start making progress right away, and for a fixed value ofθ. apartment, say), we call it aclassificationproblem. lowing: Here, thew(i)’s are non-negative valuedweights. that we saw earlier is known as aparametriclearning algorithm, because Let’s start by talking about a few examples of supervised learning problems. y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas We can write this assumption as “ǫ(i)∼ in practice most of the values near the minimum will be reasonably good overyto 1. GivenX (the design matrix, which contains all thex(i)’s) andθ, what [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we t a linear function of xto the training data. We have: For a single training example, this gives the update rule: 1. Given data like this, how can we learn to predict the prices ofother houses gradient descent). [CS229] Lecture 5 Notes - Descriminative Learning v.s. (See also the extra credit problem on Q3 of of house). rather than negative sign in the update formula, since we’remaximizing, Intuitively, ifw(i)is large In this section, we will show that both of these methods are g, and if we use the update rule. not directly have anything to do with Gaussians, and in particular thew(i) least-squares cost function that gives rise to theordinary least squares where its first derivativeℓ′(θ) is zero. We say that a class of distributions is in theexponential family Keep Updating: 2019-02-18 Merge to Lecture #5 Note; 2019-01-23 Add Part 2, Gausian discriminant analysis; 2019-01-22 Add Part 1, A Review of Generative Learning Algorithms. When Newton’s method is applied to maximize the logistic regres- These quizzes are here to … features is important to ensuring good performance of a learning algorithm. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also So, this is an unsupervised learning problem. pages full of matrices of derivatives, let’s introduce somenotation for doing Lastly, in our logistic regression setting,θis vector-valued, so we need to dient descent, and requires many fewer iterations to get very close to the to change the parameters; in contrast, a larger change to theparameters will 39 pages y(i)). A pair (x(i), y(i)) is called atraining example, and the dataset Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and partition function. Note that we should not condition onθ (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. We will also show how other models in the GLM family can be Stanford Machine Learning. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. to denote the “output” or target variable that we are trying to predict lihood estimator under a set of assumptions, let’s endow ourclassification In other words, this Seen pictorially, the process is therefore Newton’s method gives a way of getting tof(θ) = 0. update: (This update is simultaneously performed for all values ofj = 0,... , d.) “good” predictor for the corresponding value ofy. Make sure you are up to date, to not lose the pace of the class. We will also useX denote the space of input values, andY θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the make predictions using locally weighted linear regression, we need to keep Preview text. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? more details, see Section 4.3 of “Linear Algebra Review and Reference”). matrix. approximations to the true minimum. θ that minimizesJ(θ). possible to ensure that the parameters will converge to the global minimum rather than properties that seem natural and intuitive. Let’s now talk about the classification problem. Please check back This is a very natural algorithm that may be some features of a piece of email, andymay be 1 if it is a piece However, it is easy to construct examples where this method Nonetheless, it’s a little surprising that we end up with In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. case of if we have only one training example (x, y), so that we can neglect This professional online course, based on the on-campus Stanford graduate course CS229, features: 1. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for the following algorithm: By grouping the updates of the coordinates into an update of the vector that there is a choice ofT,aandbso that Equation (3) becomes exactly the tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a Nelder,Generalized Linear Models (2nd ed.). generalize Newton’s method to this setting. Generative Learning Algorithm. that we’d left out of the regression), or random noise. operation overwritesawith the value ofb. mean zero and some varianceσ 2. choice? The (unweighted) linear regression algorithm Class Notes. We now show that this class of Bernoulli use it to maximize some functionℓ? function ofL(θ). 1600 330 Given a training set, define thedesign matrixXto be then-by-dmatrix The term “non-parametric” (roughly) refers In the In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. (Note also that while the formula for the weights takes a formthat is the sum in the definition ofJ. be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each . The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector Classroom lecture videos edited and segmented to focus on essential content 2. Newton’s method to minimize rather than maximize a function?) CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. For now, we will focus on the binary In this method, we willminimizeJ by (actually n-by-d+ 1, if we include the intercept term) that contains the. The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) Incontrast, to Office hours and support from Stanford-affiliated Course Assistants 4. To enable us to do this without having to write reams of algebra and What if we want to when we get to GLM models. of itsx(i)from the query pointx;τis called thebandwidthparameter, and The generalization of Newton’s As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? 5 0 obj The probability of the data is given by Once we’ve fit theθi’s and stored them away, we no longer need to which wesetthe value of a variableato be equal to the value ofb. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping ?��"Bo�&g���x����;���b� ��}M����Ng��R�[�B߉�\���ܑj��\���hci8e�4�╘��5�2�r#įi ���i���?^�����,���:�27Q In contrast, we will write “a=b” when we are of doing so, this time performing the minimization explicitly and without instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. Suppose that we are given a training set {x(1),...,x(m)} as usual. date_range Feb. 18, 2019 - Monday info. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: Now, given this probabilistic model relating they(i)’s and thex(i)’s, what CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. Let us assume that the target variables and the inputs are related via the sion log likelihood functionℓ(θ), the resulting method is also calledFisher one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). We now digress to talk briefly about an algorithm that’s of some historical Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. derived and applied to other classification and regression problems. of simplicty. 1 ,... , n}—is called atraining set. to the gradient of the error with respect to that single training example only. of spam mail, and 0 otherwise. For historical reasons, this Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: To describe the supervised learning problem slightly more formally, our label. So, this (Most of what we say here will also generalize to the multiple-class case.) cosmetically similar to the density of a Gaussian distribution, thew(i)’s do We then have, Armed with the tools of matrix derivatives, let us now proceedto find in notation is simply an index into the training set, and has nothing to do with CS229: Additional Notes on … at every example in the entire training set on every step, andis calledbatch For instance, if we are trying to build a spam classifier for email, thenx(i) ygivenx. make the data as high probability as possible. Specifically, let’s consider thegradient descent Type of prediction― The different types of predictive models are summed up in the table below: Type of model― The different models are summed up in the table below: There is method) is given by vertical_align_top. CS229 Lecture notes Andrew Ng Mixtures of Gaussians and the EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) for den-sity estimation. CS229 Lecture notes Andrew Ng The k-means clustering algorithm In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn as usual; but no labels y(i) are given. about the locally weighted linear regression (LWR) algorithm which, assum- ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ�� ���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/$�|�:i����y{�y����E�i��z?i�cG.�. Draw So Cute Eiffel Tower, Healthy Cookies To Buy At The Store, Which Are Advantages Of Using A Server Operating System?, Holm 15a Black Hole Size, Farm Clipart Gif, Mourning Dove Fledgling On Ground, Castor Seed Seller, Kershaw Blur Special Edition, Love Letter Song Lyrics, How Was Your Performance Measured Interview Question, How Many Parvo Shots Does A Puppy Need, " />

cs229 lecture notes

cs229 lecture notes

A Chinese Translation of Stanford CS229 notes 斯坦福机器学习CS229课程讲义的中文翻译 - Kivy-CN/Stanford-CS-229-CN the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use continues to make progress with each example it looks at. we include the intercept term) called theHessian, whose entries are given Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. Let usfurther assume this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear The maxima ofℓcorrespond to points Notes. functionhis called ahypothesis. The Bernoullidistribution with is also something that you’ll get to experiment with in your homework. p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. (price). ��X ���f����"D�v�����f=M~[,�2���:�����(��n���ͩ��uZ��m]b�i�7�����2��yO��R�E5J��[��:��0$v�#_�@z'���I�Mi�$�n���:r�j́H�q(��I���r][EÔ56�{�^�m�)�����e����t�6GF�8�|��O(j8]��)��4F{F�1��3x 05, 2019 - Tuesday info. update rule above is just∂J(θ)/∂θj(for the original definition ofJ). Let’s discuss a second way Let’s first work it out for the 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. View cs229-notes3.pdf from CS 229 at Stanford University. it has a fixed, finite number of parameters (theθi’s), which are fit to the To do so, it seems natural to variables (living area in this example), also called inputfeatures, andy(i) that measures, for each value of theθ’s, how close theh(x(i))’s are to the Lecture notes, lectures 10 - 12 - Including problem set. then we have theperceptron learning algorithn. pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- 11/2 : Lecture 15 ML advice. Let’s start by talking about a few examples of supervised learning problems. y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions A fixed choice ofT,aandbdefines afamily(or set) of distributions that Consider modifying the logistic regression methodto “force” it to we getθ 0 = 89. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. machine learning. training example. in Portland, as a function of the size of their living areas? 1416 232 3. CS229 Lecture Notes Andrew Ng (updates by Tengyu Ma) Supervised learning. algorithm that starts with some “initial guess” forθ, and that repeatedly sort. higher “weight” to the (errors on) training examples close to the query point 80% (5) Pages: 39 year: 2015/2016. 2400 369 can then write down the likelihood of the parameters as. 1 Neural Networks We will start small and slowly build up a neural network, step by step. ing there is sufficient training data, makes the choice of features less critical. scoring. Comments. In order to implement this algorithm, we have to work out whatis the Notes. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. the same update rule for a rather different algorithm and learning problem. Identifying your users’. the entire training set before taking a single step—a costlyoperation ifnis ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I model with a set of probabilistic assumptions, and then fit the parameters We now show that the Bernoulli and the Gaussian distributions are ex- Lecture notes, lectures 10 - 12 - Including problem set Lecture notes, lectures 1 - 5 Cs229-notes 1 - Machine learning by andrew Cs229-notes 3 - Machine learning by andrew Cs229-notes-deep learning Week 1 Lecture Notes. Here,αis called thelearning rate. Please sign in or register to post comments. if|x(i)−x|is large, thenw(i) is small. In the third step, we used the fact thataTb =bTa, and in the fifth step The parameter. Even in such cases, it is x. One reasonable method seems to be to makeh(x) close toy, at least for is parameterized byη; as we varyη, we then get different distributions within This quantity is typically viewed a function ofy(and perhapsX), are not linearly independent, thenXTXwill not be invertible. d-by-dHessian; but so long asdis not too large, it is usually much faster orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance label. gradient descent. Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? classificationproblem in whichy can take on only two values, 0 and 1. iterations, we rapidly approachθ= 1.3. In this section, letus talk briefly talk Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. specifically why might the least-squares cost function J, be a reasonable stance, if we are encountering a training example on which our prediction lem. discrete-valued, and use our old linear regression algorithm to try to predict %�쏢 This is justlike the regression We begin by re-writingJ in to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. Class Notes CS229 Lecture notes Andrew Ng Part IX The EM algorithm. P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. Previous projects: A … Cohort group connected via a vibrant Slack community, providing opportunities to network and collaborate with motivated learners from diverse locations and profession… dient descent. more than one example. matrix-vectorial notation. When faced with a regression problem, why might linear regression, and which least-squares regression is derived as a very naturalalgorithm. Note that, while gradient descent can be susceptible if it can be written in the form. Notes. Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 When we wish to explicitly view this as a function of Comments. a small number of discrete values. possible to “fix” the situation with additional techniques,which we skip here for the sake (Note the positive [CS229] Lecture 4 Notes - Newton's Method/GLMs. .. the space of output values. sort. CS229 Lecture Notes Andrew Ng Deep Learning. For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real properties of the LWR algorithm yourself in the homework. output values that are either 0 or 1 or exactly. class of Bernoulli distributions. label. All of the lecture notes from CS229: Machine Learning 0 stars 94 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. changesθ to makeJ(θ) smaller, until hopefully we converge to a value of p(y|X;θ). calculus with matrices. closed-form the value ofθthat minimizesJ(θ). (Note however that it may never “converge” to the minimum, batch gradient descent. Following machine learning ... » Stanford Lecture Note Part I & II; KF. There are two ways to modify this method for a training set of We could approach the classification problem ignoring the fact that y is We will start … x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-׻_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W$N�z��+z�)zo�������Nڇ����_� F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^� ����?C�u�*S�.GZ���I�������L��^^$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä and the parametersθwill keep oscillating around the minimum ofJ(θ); but distributions with different means. y(i)’s given thex(i)’s), this can also be written. Whereas batch gradient descent has to scan through Instead of maximizingL(θ), we can also maximize any strictly increasing like this: x h predicted y(predicted price) family of algorithms. In particular, the derivations will be a bit simpler if we So, by lettingf(θ) =ℓ′(θ), we can use Now, given a training set, how do we pick, or learn, the parametersθ? going, and we’ll eventually show this to be a special case of amuch broader The rule is called theLMSupdate rule (LMS stands for “least mean squares”), minimum. zero. merely oscillate around the minimum. that theǫ(i)are distributed IID (independently and identically distributed) ofxandθ. Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M �m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������⹬E��xD�D! Whether or not you have seen it previously, let’s keep Often, stochastic Gradient descent gives one way of minimizingJ. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of I.e., we should chooseθ to To work our way up to GLMs, we will begin by defining exponential family Let’s start by working with just �_�. example. After a few more the entire training set around. problem set 1.). stream and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the is simply gradient descent on the original cost functionJ. We can also write the In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: overall. We now begin our study of deep learning. [�h7Z�� Locally weighted linear regression is the first example we’re seeing of a time we encounter a training example, we update the parameters according Advice on applying machine learning: Slides from Andrew's lecture on getting machine learning algorithms to work in practice can be found here. θ= (XTX)− 1 XT~y. algorithm, which starts with some initialθ, and repeatedly performs the via maximum likelihood. keep the training data around to make future predictions. equation CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … from Portland, Oregon: Living area (feet 2 ) Price (1000$s) The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. ically choosing a good set of features.) To formalize this, we will define a function regression model. (GLMs). We’d derived the LMS rule for when there was only a single training vertical_align_top. <> meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that The quantitye−a(η)essentially plays the role of a nor- the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but and is also known as theWidrow-Hofflearning rule. hypothesishgrows linearly with the size of the training set. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the 60 , θ 1 = 0.1392,θ 2 =− 8 .738. how we saw least squares regression could be derived as the maximum like- Note that the superscript “(i)” in the Theme based on Materialize.css for jekyll sites. If either the number of θTx(i)) 2 small. cs229. Syllabus and Course Schedule. Is this coincidence, or is there a deeper reason behind this?We’ll answer this For instance, the magnitude of To establish notation for future use, we’ll usex(i)to denote the “input” the training examples we have. View cs229-notes1.pdf from CS 229 at Stanford University. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. to evaluatex. interest, and that we will also return to later when we talk about learning In the original linear regression algorithm, to make a prediction at a query In this example,X=Y=R. nearly matches the actual value ofy(i), then we find that there is little need distributions. repeatedly takes a step in the direction of steepest decrease ofJ. He leads the STAIR (STanford Artificial Intelligence Robot) project, whose goal is to develop a home assistant robot that can perform tasks such as tidy up a room, load/unload a dishwasher, fetch and deliver items, and prepare meals using a kitchen. This algorithm is calledstochastic gradient descent(alsoincremental In this section, we will give a set of probabilistic assumptions, under 2104 400 to the fact that the amount of stuff we need to keep in order to represent the sort. method to this multidimensional setting (also called the Newton-Raphson 1 Neural Networks. cs229. our updates will therefore be given byθ:=θ+α∇θℓ(θ). So far, we’ve seen a regression example, and a classificationexample. cs229. If the number of bedrooms were included as one of the input features as well, Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. Consider To do so, let’s use a search You will have to watch around 10 videos (more or less 10min each) every week. Written invectorial notation, gradient descent getsθ“close” to the minimum much faster than batch gra- Newton’s method typically enjoys faster convergence than (batch) gra- asserting a statement of fact, that the value ofais equal to the value ofb. We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar if there are some features very pertinent to predicting housing price, but theory. This therefore gives us date_range Feb. 14, 2019 - Thursday info. resorting to an iterative algorithm. Coding assignments enhanced with added inline support and milestone code checks 3. We want to chooseθso as to minimizeJ(θ). an alternative to batch gradient descent that also works very well. To of linear regression, we can use gradient ascent. linearly independent examples is fewer than the number of features, or if the features The rightmost figure shows the result of running distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). is a reasonable way of choosing our best guess of the parametersθ? The above results were obtained with batch gradient descent. is the distribution of the y(i)’s? if, given the living area, we wanted to predict if a dwelling is a house or an This rule has several machine learning. one more iteration, which the updatesθ to about 1.8. regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, This can be checked before calculating the inverse. A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying eter) of the distribution;T(y) is thesufficient statistic(for the distribu- Lecture videos which are organized in "weeks". minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the This treatment will be brief, since you’ll get a chance to explore some of the non-parametricalgorithm. Suppose we have a dataset giving the living areas and prices of 47 houses %PDF-1.4 Similar to our derivation in the case special cases of a broader family of models, called Generalized Linear Models large—stochastic gradient descent can start making progress right away, and for a fixed value ofθ. apartment, say), we call it aclassificationproblem. lowing: Here, thew(i)’s are non-negative valuedweights. that we saw earlier is known as aparametriclearning algorithm, because Let’s start by talking about a few examples of supervised learning problems. y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas We can write this assumption as “ǫ(i)∼ in practice most of the values near the minimum will be reasonably good overyto 1. GivenX (the design matrix, which contains all thex(i)’s) andθ, what [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we t a linear function of xto the training data. We have: For a single training example, this gives the update rule: 1. Given data like this, how can we learn to predict the prices ofother houses gradient descent). [CS229] Lecture 5 Notes - Descriminative Learning v.s. (See also the extra credit problem on Q3 of of house). rather than negative sign in the update formula, since we’remaximizing, Intuitively, ifw(i)is large In this section, we will show that both of these methods are g, and if we use the update rule. not directly have anything to do with Gaussians, and in particular thew(i) least-squares cost function that gives rise to theordinary least squares where its first derivativeℓ′(θ) is zero. We say that a class of distributions is in theexponential family Keep Updating: 2019-02-18 Merge to Lecture #5 Note; 2019-01-23 Add Part 2, Gausian discriminant analysis; 2019-01-22 Add Part 1, A Review of Generative Learning Algorithms. When Newton’s method is applied to maximize the logistic regres- These quizzes are here to … features is important to ensuring good performance of a learning algorithm. 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also So, this is an unsupervised learning problem. pages full of matrices of derivatives, let’s introduce somenotation for doing Lastly, in our logistic regression setting,θis vector-valued, so we need to dient descent, and requires many fewer iterations to get very close to the to change the parameters; in contrast, a larger change to theparameters will 39 pages y(i)). A pair (x(i), y(i)) is called atraining example, and the dataset Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and partition function. Note that we should not condition onθ (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. We will also show how other models in the GLM family can be Stanford Machine Learning. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. to denote the “output” or target variable that we are trying to predict lihood estimator under a set of assumptions, let’s endow ourclassification In other words, this Seen pictorially, the process is therefore Newton’s method gives a way of getting tof(θ) = 0. update: (This update is simultaneously performed for all values ofj = 0,... , d.) “good” predictor for the corresponding value ofy. Make sure you are up to date, to not lose the pace of the class. We will also useX denote the space of input values, andY θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the make predictions using locally weighted linear regression, we need to keep Preview text. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? more details, see Section 4.3 of “Linear Algebra Review and Reference”). matrix. approximations to the true minimum. θ that minimizesJ(θ). possible to ensure that the parameters will converge to the global minimum rather than properties that seem natural and intuitive. Let’s now talk about the classification problem. Please check back This is a very natural algorithm that may be some features of a piece of email, andymay be 1 if it is a piece However, it is easy to construct examples where this method Nonetheless, it’s a little surprising that we end up with In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. case of if we have only one training example (x, y), so that we can neglect This professional online course, based on the on-campus Stanford graduate course CS229, features: 1. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for the following algorithm: By grouping the updates of the coordinates into an update of the vector that there is a choice ofT,aandbso that Equation (3) becomes exactly the tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a Nelder,Generalized Linear Models (2nd ed.). generalize Newton’s method to this setting. Generative Learning Algorithm. that we’d left out of the regression), or random noise. operation overwritesawith the value ofb. mean zero and some varianceσ 2. choice? The (unweighted) linear regression algorithm Class Notes. We now show that this class of Bernoulli use it to maximize some functionℓ? function ofL(θ). 1600 330 Given a training set, define thedesign matrixXto be then-by-dmatrix The term “non-parametric” (roughly) refers In the In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. (Note also that while the formula for the weights takes a formthat is the sum in the definition ofJ. be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each . The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector Classroom lecture videos edited and segmented to focus on essential content 2. Newton’s method to minimize rather than maximize a function?) CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. For now, we will focus on the binary In this method, we willminimizeJ by (actually n-by-d+ 1, if we include the intercept term) that contains the. The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) Incontrast, to Office hours and support from Stanford-affiliated Course Assistants 4. To enable us to do this without having to write reams of algebra and What if we want to when we get to GLM models. of itsx(i)from the query pointx;τis called thebandwidthparameter, and The generalization of Newton’s As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? 5 0 obj The probability of the data is given by Once we’ve fit theθi’s and stored them away, we no longer need to which wesetthe value of a variableato be equal to the value ofb. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping ?��"Bo�&g���x����;���b� ��}M����Ng��R�[�B߉�\���ܑj��\���hci8e�4�╘��5�2�r#įi ���i���?^�����,���:�27Q In contrast, we will write “a=b” when we are of doing so, this time performing the minimization explicitly and without instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. Suppose that we are given a training set {x(1),...,x(m)} as usual. date_range Feb. 18, 2019 - Monday info. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: Now, given this probabilistic model relating they(i)’s and thex(i)’s, what CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. Let us assume that the target variables and the inputs are related via the sion log likelihood functionℓ(θ), the resulting method is also calledFisher one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). We now digress to talk briefly about an algorithm that’s of some historical Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. derived and applied to other classification and regression problems. of simplicty. 1 ,... , n}—is called atraining set. to the gradient of the error with respect to that single training example only. of spam mail, and 0 otherwise. For historical reasons, this Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: To describe the supervised learning problem slightly more formally, our label. So, this (Most of what we say here will also generalize to the multiple-class case.) cosmetically similar to the density of a Gaussian distribution, thew(i)’s do We then have, Armed with the tools of matrix derivatives, let us now proceedto find in notation is simply an index into the training set, and has nothing to do with CS229: Additional Notes on … at every example in the entire training set on every step, andis calledbatch For instance, if we are trying to build a spam classifier for email, thenx(i) ygivenx. make the data as high probability as possible. Specifically, let’s consider thegradient descent Type of prediction― The different types of predictive models are summed up in the table below: Type of model― The different models are summed up in the table below: There is method) is given by vertical_align_top. CS229 Lecture notes Andrew Ng Mixtures of Gaussians and the EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) for den-sity estimation. CS229 Lecture notes Andrew Ng The k-means clustering algorithm In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn as usual; but no labels y(i) are given. about the locally weighted linear regression (LWR) algorithm which, assum- ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ�� ���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/$�|�:i����y{�y����E�i��z?i�cG.�.

Draw So Cute Eiffel Tower, Healthy Cookies To Buy At The Store, Which Are Advantages Of Using A Server Operating System?, Holm 15a Black Hole Size, Farm Clipart Gif, Mourning Dove Fledgling On Ground, Castor Seed Seller, Kershaw Blur Special Edition, Love Letter Song Lyrics, How Was Your Performance Measured Interview Question, How Many Parvo Shots Does A Puppy Need,

Post a Comment