Read Text Data in Python as Dattaframe

27. Reading and Writing Data in Pandas

By Bernd Klein. Terminal modified: 01 February 2022.

On this page ➤

All the powerful data structures like the Serial and the DataFrames would avail to cipher, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out data. Information technology is not only a thing of having a functions for interacting with files. To be useful to data scientists it also needs functions which back up the most important data formats like

Delimiter-separated files, like e.g. csv
Microsoft Excel files
HTML
XML
JSON

Delimiter-separated Values

Digits as File Input and Output

Most people take csv files as a synonym for delimter-separated values files. They exit the fact out of account that csv is an acronym for "comma separated values", which is not the case in many situations. Pandas also uses "csv" and contexts, in which "dsv" would exist more than appropriate.

Delimiter-separated values (DSV) are defined and stored 2-dimensional arrays (for case strings) of data by separating the values in each row with delimiter characters defined for this purpose. This fashion of implementing data is often used in combination of spreadsheet programs, which can read in and write out data as DSV. They are besides used as a general data commutation format.

We call a text file a "delimited text file" if information technology contains text in DSV format.

For example, the file dollar_euro.txt is a delimited text file and uses tabs (\t) equally delimiters.

Reading CSV and DSV Files

Pandas offers two means to read in CSV or DSV files to be precise:

DataFrame.from_csv
read_csv

At that place is no big departure between those ii functions, due east.g. they have dissimilar default values in some cases and read_csv has more paramters. We volition focus on read_csv, considering DataFrame.from_csv is kept inside Pandas for reasons of backwards compatibility.

              import              pandas              as              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              )              print              (              exchange_rates              )

OUTPUT:

              Year   Average  Min USD/EUR  Max USD/EUR  Working days 0   2016  0.901696     0.864379     0.959785           247 ane   2015  0.901896     0.830358     0.947688           256 2   2014  0.753941     0.716692     0.823655           255 iii   2013  0.753234     0.723903     0.783208           255 iv   2012  0.778848     0.743273     0.827198           256 5   2011  0.719219     0.671953     0.775855           257 half-dozen   2010  0.755883     0.686672     0.837381           258 7   2009  0.718968     0.661376     0.796495           256 eight   2008  0.683499     0.625391     0.802568           256 9   2007  0.730754     0.672314     0.775615           255 10  2006  0.797153     0.750131     0.845594           255 11  2005  0.805097     0.740357     0.857118           257 12  2004  0.804828     0.733514     0.847314           259 13  2003  0.885766     0.791766     0.963670           255 14  2002  1.060945     0.953562     1.165773           255 15  2001  1.117587     1.047669     1.192748           255 xvi  2000  one.085899     0.962649     one.211827           255 17  1999  0.939475     0.848176     0.998502           261

As we can come across, read_csv used automatically the first line as the names for the columns. It is possible to requite other names to the columns. For this purpose, we have to skip the get-go line by setting the parameter "header" to 0 and nosotros have to assign a list with the cavalcade names to the parameter "names":

              import              pandas              every bit              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              ,              header              =              0              ,              names              =              [              "year"              ,              "min"              ,              "max"              ,              "days"              ])              impress              (              exchange_rates              )

OUTPUT:

              year       min       max  days 2016  0.901696  0.864379  0.959785   247 2015  0.901896  0.830358  0.947688   256 2014  0.753941  0.716692  0.823655   255 2013  0.753234  0.723903  0.783208   255 2012  0.778848  0.743273  0.827198   256 2011  0.719219  0.671953  0.775855   257 2010  0.755883  0.686672  0.837381   258 2009  0.718968  0.661376  0.796495   256 2008  0.683499  0.625391  0.802568   256 2007  0.730754  0.672314  0.775615   255 2006  0.797153  0.750131  0.845594   255 2005  0.805097  0.740357  0.857118   257 2004  0.804828  0.733514  0.847314   259 2003  0.885766  0.791766  0.963670   255 2002  1.060945  0.953562  1.165773   255 2001  1.117587  one.047669  1.192748   255 2000  1.085899  0.962649  one.211827   255 1999  0.939475  0.848176  0.998502   261

Exercise one

The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to split groups of thousands in the numbers. The method 'caput(n)' of a DataFrame can be used to give out only the showtime n rows or lines. Read the file into a DataFrame.

Solution:

              pop              =              pd              .              read_csv              (              "/data1/countries_population.csv"              ,              header              =              None              ,              names              =              [              "State"              ,              "Population"              ],              index_col              =              0              ,              quotechar              =              "'"              ,              sep              =              " "              ,              thousands              =              ","              )              print              (              popular              .              head              (              5              ))

OUTPUT:

              Population Country                    People's republic of china           1355692576 Republic of india           1236344631 European Spousal relationship   511434812 United states    318892103 Indonesia        253609643

Writing csv Files

Writing CSV Files

Nosotros tin create csv (or dsv) files with the method "to_csv". Earlier we do this, we volition prepare some data to output, which we will write to a file. Nosotros have two csv files with population data for various countries. countries_male_population.csv contains the figures of the male populations and countries_female_population.csv correspondingly the numbers for the female person populations. We volition create a new csv file with the sum:

            column_names            =            [            "Country"            ]            +            list            (            range            (            2002            ,            2013            ))            male_pop            =            pd            .            read_csv            (            "/data1/countries_male_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            female_pop            =            pd            .            read_csv            (            "/data1/countries_female_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            population            =            male_pop            +            female_pop

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
Country
Australia	19640979.0	19872646	20091504	20339759	20605488	21015042	21431781	21874920	22342398	22620554	22683573
Austria	8139310.0	8067289	8140122	8206524	8265925	8298923	8331930	8355260	8375290	8404252	8443018
Belgium	10309725.0	10355844	10396421	10445852	10511382	10584534	10666866	10753080	10839905	10366843	11035958
Canada	NaN	31361611	31372587	31989454	32299496	32649482	32927372	33327337	33334414	33927935	34492645
Czech Republic	10269726.0	10203269	10211455	10220577	10251079	10287189	10381130	10467542	10506813	10532770	10505445
Denmark	5368354.0	5383507	5397640	5411405	5427459	5447084	5475791	5511451	5534738	5560628	5580516
Finland	5194901.0	5206295	5219732	5236611	5255580	5276955	5300484	5326314	5351427	5375276	5401267
France	59337731.0	59630121	59900680	62518571	62998773	63392140	63753140	64366962	64716310	65129746	65394283
Germany	82440309.0	82536680	82531671	82500849	82437995	82314906	82217837	82002356	81802257	81751602	81843743
Greece	10988000.0	11006377	11040650	11082751	11125179	11171740	11213785	11260402	11305118	11309885	11290067
Hungary	10174853.0	10142362	10116742	10097549	10076581	10066158	10045401	10030975	10014324	9985722	9957731
Iceland	286575.0	288471	290570	293577	299891	307672	315459	319368	317630	318452	319575
Republic of ireland	3882683.0	3963636	4027732	4109173	4209019	4239848	4401335	4450030	4467854	4569864	4582769
Italy	56993742.0	57321070	57888245	58462375	58751711	59131287	59619290	60045068	60340328	60626442	60820696
Nippon	127291000.0	127435000	127620000	127687000	127767994	127770000	127771000	127692000	127510000	128057000	127799000
Korea	47639618.0	47925318	48082163	48138077	48297184	48456369	48606787	48746693	48874539	49779440	50004441
Luxembourg	444050.0	448300	451600	455000	469086	476187	483799	493500	502066	511840	524853
Mexico	101826249.0	103039964	104213503	103001871	103946866	104874282	105790725	106682518	107550697	108396211	115682867
Netherlands	16105285.0	16192572	16258032	16305526	16334210	16357992	16405399	16485787	16574989	16655799	16730348
New Zealand	3939130.0	4009200	4062500	4100570	4139470	4228280	4268880	4315840	4367740	4405150	4433100
Norway	4524066.0	4552252	4577457	4606363	4640219	4681134	4737171	4799252	4858199	4920305	4985870
Poland	38632453.0	38218531	38190608	38173835	38157055	38125479	38115641	38135876	38167329	38200037	38538447
Portugal	10335559.0	10407465	10474685	10529255	10569592	10599095	10617575	10627250	10637713	10636979	10542398
Slovak Republic	5378951.0	5379161	5380053	5384822	5389180	5393637	5400998	5412254	5424925	5435273	5404322
Kingdom of spain	40409330.0	41550584	42345342	43038035	43758250	44474631	45283259	45828172	45989016	46152926	46818221
Sweden	8909128.0	8940788	8975670	9011392	9047752	9113257	9182927	9256347	9340682	9415570	9482855
Switzerland	7261210.0	7313853	7364148	7415102	7459128	7508739	7593494	7701856	7785806	7870134	7954662
Turkey	NaN	70171979	70689500	71607500	72519974	72519974	70586256	71517100	72561312	73722988	74724269
United kingdom	58706905.0	59262057	59699828	60059858	60412870	60781346	61179260	61595094	62026962	62498612	63256154
U.s.	277244916.0	288774226	290810719	294442683	297308143	300184434	304846731	305127551	307756577	309989078	312232049

            population            .            to_csv            (            "/data1/countries_total_population.csv"            )

We want to create a new DataFrame with all the data, i.due east. female, male and complete population. This means that we have to introduce an hierarchical index. Before nosotros practise it on our DataFrame, nosotros will introduce this problem in a unproblematic example:

              import              pandas              as              pd              shop1              =              {              "foo"              :{              2010              :              23              ,              2011              :              25              },              "bar"              :{              2010              :              13              ,              2011              :              29              }}              shop2              =              {              "foo"              :{              2010              :              223              ,              2011              :              225              },              "bar"              :{              2010              :              213              ,              2011              :              229              }}              shop1              =              pd              .              DataFrame              (              shop1              )              shop2              =              pd              .              DataFrame              (              shop2              )              both_shops              =              shop1              +              shop2              impress              (              "Sales of shop1:              \due north              "              ,              shop1              )              print              (              "              \n              Sales of both shops              \n              "              ,              both_shops              )

OUTPUT:

Sales of shop1:        foo  bar 2010   23   thirteen 2011   25   29  Sales of both shops        foo  bar 2010  246  226 2011  250  258

              shops              =              pd              .              concat              ([              shop1              ,              shop2              ],              keys              =              [              "one"              ,              "two"              ])              shops

		foo	bar
one	2010	23	13
one	2011	25	29
two	2010	223	213
two	2011	225	229

We desire to bandy the hierarchical indices. For this we volition apply 'swaplevel':

              shops              .              swaplevel              ()              shops              .              sort_index              (              inplace              =              True              )              shops

		foo	bar
one	2010	23	13
one	2011	25	29
two	2010	223	213
two	2011	225	229

We will go dorsum to our initial trouble with the population figures. We will utilize the aforementioned steps to those DataFrames:

              pop_complete              =              pd              .              concat              ([              population              .              T              ,              male_pop              .              T              ,              female_pop              .              T              ],              keys              =              [              "total"              ,              "male"              ,              "female"              ])              df              =              pop_complete              .              swaplevel              ()              df              .              sort_index              (              inplace              =              True              )              df              [[              "Austria"              ,              "Commonwealth of australia"              ,              "France"              ]]

	Country	Austria	Australia	France
2002	female person	4179743.0	9887846.0	30510073.0
	male	3959567.0	9753133.0	28827658.0
	total	8139310.0	19640979.0	59337731.0
2003	female	4158169.0	9999199.0	30655533.0
	male	3909120.0	9873447.0	28974588.0
	total	8067289.0	19872646.0	59630121.0
2004	female person	4190297.0	10100991.0	30789154.0
	male	3949825.0	9990513.0	29111526.0
	total	8140122.0	20091504.0	59900680.0
2005	female person	4220228.0	10218321.0	32147490.0
	male	3986296.0	10121438.0	30371081.0
	total	8206524.0	20339759.0	62518571.0
2006	female	4246571.0	10348070.0	32390087.0
	male	4019354.0	10257418.0	30608686.0
	total	8265925.0	20605488.0	62998773.0
2007	female	4261752.0	10570420.0	32587979.0
	male	4037171.0	10444622.0	30804161.0
	full	8298923.0	21015042.0	63392140.0
2008	female	4277716.0	10770864.0	32770860.0
	male	4054214.0	10660917.0	30982280.0
	full	8331930.0	21431781.0	63753140.0
2009	female	4287213.0	10986535.0	33208315.0
	male	4068047.0	10888385.0	31158647.0
	full	8355260.0	21874920.0	64366962.0
2010	female person	4296197.0	11218144.0	33384930.0
	male person	4079093.0	11124254.0	31331380.0
	total	8375290.0	22342398.0	64716310.0
2011	female person	4308915.0	11359807.0	33598633.0
	male	4095337.0	11260747.0	31531113.0
	full	8404252.0	22620554.0	65129746.0
2012	female	4324983.0	11402769.0	33723892.0
	male	4118035.0	11280804.0	31670391.0
	total	8443018.0	22683573.0	65394283.0

            df            .            to_csv            (            "/data1/countries_total_population.csv"            )

Live Python preparation

instructor-led training course

Upcoming online Courses

Information Analysis With Python

09 Mar 2022 to xi Mar 2022
18 May 2022 to twenty May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022

Enrol here

Exercise ii

Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area', 'female', 'male', 'population' and 'density' (inhabitants per foursquare kilometres.
print out the rows where the area is greater than 30000 and the population is greater than 10000
Print the rows where the density is greater than 300

              lands              =              pd              .              read_csv              (              '/data1/bundeslaender.txt'              ,              sep              =              " "              )              print              (              lands              .              columns              .              values              )

OUTPUT:

['state' 'expanse' 'male person' 'female']

              # swap the columns of our DataFrame:              lands              =              lands              .              reindex              (              columns              =              [              'country'              ,              'area'              ,              'female'              ,              'male'              ])              lands              [:              ii              ]

	state	area	female	male
0	Baden-Württemberg	35751.65	5465	5271
one	Bayern	70551.57	6366	6103

            lands            .            insert            (            loc            =            len            (            lands            .            columns            ),            column            =            'population'            ,            value            =            lands            [            'female person'            ]            +            lands            [            'male person'            ])

	land	area	female	male	population
0	Baden-Württemberg	35751.65	5465	5271	10736
1	Bayern	70551.57	6366	6103	12469
two	Berlin	891.85	1736	1660	3396

              lands              .              insert              (              loc              =              len              (              lands              .              columns              ),              cavalcade              =              'density'              ,              value              =              (              lands              [              'population'              ]              *              thousand              /              lands              [              'area'              ])              .              round              (              0              ))              lands              [:              4              ]

	land	expanse	female	male	population	density
0	Baden-Württemberg	35751.65	5465	5271	10736	300.0
i	Bayern	70551.57	6366	6103	12469	177.0
ii	Berlin	891.85	1736	1660	3396	3808.0
iii	Brandenburg	29478.61	1293	1267	2560	87.0

              impress              (              lands              .              loc              [(              lands              .              area              >              30000              )              &              (              lands              .              population              >              10000              )])

OUTPUT:

              land      area  female  male person  population  density 0    Baden-Württemberg  35751.65    5465  5271       10736    300.0 1               Bayern  70551.57    6366  6103       12469    177.0 nine  Nordrhein-Westfalen  34085.29    9261  8797       18058    530.0

Reading and Writing Excel Files

It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files utilise the modules 'xlrd' and 'openpyxl'. These modules are not automatically installed by Pandas, so you may take to install them manually!

We will utilise a simple Excel document to demonstrate the reading capabilities of Pandas. The document sales.xls contains two sheets, 1 called 'week1' and the other one 'week2'.
An Excel file can exist read in with the Pandas function "read_excel". This is demonstrated in the following example Python code:

              excel_file              =              pd              .              ExcelFile              (              "/data1/sales.xls"              )              canvas              =              pd              .              read_excel              (              excel_file              )              sheet

	Weekday	Sales
0	Monday	123432.980000
1	Tuesday	122198.650200
2	Wednesday	134418.515220
3	Th	131730.144916
4	Fri	128173.431003

The document "sales.xls" contains 2 sheets, but we just have been able to read in the get-go one with "read_excel". A complete Excel document, which tin consist of an arbitrary number of sheets, can exist completely read in similar this:

              docu              =              {}              for              sheet_name              in              excel_file              .              sheet_names              :              docu              [              sheet_name              ]              =              excel_file              .              parse              (              sheet_name              )              for              sheet_name              in              docu              :              impress              (              "              \n              "              +              sheet_name              +              ":              \n              "              ,              docu              [              sheet_name              ])

OUTPUT:

week1:       Weekday          Sales 0     Monday  123432.980000 i    Tuesday  122198.650200 2  Wednesday  134418.515220 3   Th  131730.144916 4     Fri  128173.431003  week2:       Weekday          Sales 0     Monday  223277.980000 1    Tuesday  234441.879000 2  Wednesday  246163.972950 three   Thursday  241240.693491 4     Fri  230143.621590

Nosotros volition calculate now the avarage sales numbers of the 2 weeks:

              average              =              docu              [              "week1"              ]              .              re-create              ()              average              [              "Sales"              ]              =              (              docu              [              "week1"              ][              "Sales"              ]              +              docu              [              "week2"              ][              "Sales"              ])              /              2              impress              (              average              )

OUTPUT:

              Weekday          Sales 0     Monday  173355.480000 1    Tuesday  178320.264600 2  Wednesday  190291.244085 3   Thursday  186485.419203 4     Friday  179158.526297

We will save the DataFrame 'average' in a new certificate with 'week1' and 'week2' as additional sheets every bit well:

            writer            =            pd            .            ExcelWriter            (            '/data1/sales_average.xlsx'            )            document            [            'week1'            ]            .            to_excel            (            writer            ,            'week1'            )            document            [            'week2'            ]            .            to_excel            (            author            ,            'week2'            )            average            .            to_excel            (            writer            ,            'average'            )            writer            .            relieve            ()            writer            .            close            ()

Sales_average LibreOffice

Live Python training

instructor-led training course

Upcoming online Courses

Information Assay With Python

09 Mar 2022 to 11 Mar 2022
eighteen May 2022 to 20 May 2022
31 Aug 2022 to 02 Sep 2022
nineteen Oct 2022 to 21 Oct 2022

Enrol here

aguilarduchich.blogspot.com

Source: https://python-course.eu/numerical-programming/reading-and-writing-data-in-pandas.php

Read Text Data in Python as Dattaframe

27. Reading and Writing Data in Pandas

Delimiter-separated Values

Reading CSV and DSV Files

OUTPUT:

OUTPUT:

Exercise one

OUTPUT:

Writing csv Files

OUTPUT:

Exercise ii

OUTPUT:

OUTPUT:

Reading and Writing Excel Files

OUTPUT:

OUTPUT:

0 Response to "Read Text Data in Python as Dattaframe"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel