Life with SPSS and other similar things

Creating new codes from one meta code

2016-07-30T00:15:00.001-07:00

This is another life saver syntax that I learnt from Minhaj. Some datasets come with only one large code with various information. For example it is common to have PSU code which includes the information about the respondent's location, sex etc.

Here is a syntax that can be modified.

SET OVars Both ONumbers Both TVars Both TNumbers Both.
SET OVars Labels ONumbers Labels TVars Labels TNumbers Labels.

comp province = trunc (psu/100000).
comp temp = trunc (psu/10000).
*comp locality = temp - (province*10).
var lab province 'Province'.
val lab province 1'Punjab' 2'Sindh' 3'NWFP' 4'Balochistan'.

format province (f1.0).

fre province.

In this syntax the point to note is that variable "psu" has 6 characters and the code for "province" is stored in position 1. Hence, it is divided by "10000" which will truncate the last 5 characters.

More sources of social science data

2016-02-03T04:46:00.004-08:00

Here is something you may consider to explore to see what social science data is available.

Github Social Science data.

"LIST" command to display the data

2016-01-31T21:39:00.001-08:00

Yesterday I was at loss as I could not remember the "LIST" command to show or display the data "as is" rather than in aggregated form. Searched the internet and racked my brain for hours but all in vain. But thanks to Muhammad Ali, who sometimes helps in data entry, I was relieved of this torture this morning.

It is a really simple command and that is why it was difficult to find it through good search. So, the command to show the data "as it is" present in the data file, here is the command.

LIST [variable name(s)].

This command is used a lot for data cleaning and for looking at string variables where you don't expect to get phrases/sentences/words that match with each other. This is also useful to list all the responses of one respondent side by side.

Weighting in SPSS

2014-12-20T05:06:00.003-08:00

I learnt this from Minhaj.

It is common in surveys to over or under sample and use weights later to compensate for such decisions. For example in Pakistan the largest province by population is undersampled while the smallest one, Balochistan is oversampled. In these case, for data analysis, it is critical that the cases are "weighted."

In the data files I have been using the weights are included in a variable which can be applied to use this technique. In the following example, the weight variable is labelled as "weight."

If you just multiply the cases by weight
WEIGHT BY weight.

The proportions in your data analysis come out right but not the Ns which means you need to apply not just the weight, but need to reduce the weight variable to represent each case. (it seems baffling until you do it). In order to do that, create a new weight variable, let's call it "whi." For that you must divide the "weight" by the SUM of all the CASES and multiply by total number of cases. In this case the SUM=13557999999 and total cases=13558.

comp whi = (weight/13557999999) * 13558.
format whi (f16.11).
var lab whi "Reduced weighted N to 13558".

Now WEIGHT your file again, with the new variable. The percentages and the Ns, both will come out right.

weight BY whi.

To turn off weight, use the following,

WEIGHT OFF.

Creating dummy variables (from UCLA site)

2013-05-01T02:21:00.001-07:00

* make dummies, method 1 .
COMPUTE race1=(race=1).
COMPUTE race2=(race=2).
COMPUTE race3=(race=3).
crosstabs /tables = race by race1 
          /tables = race by race2 
          /tables = race by race3.

* make dummies, method 2 .
DO REPEAT A=race1 race2 race3 /B=1 2 3.
COMPUTE A=(race=B).
END REPEAT.
crosstabs /tables = race by race1 
          /tables = race by race2 
          /tables = race by race3.

http://www.ats.ucla.edu/stat/spss/code/dummy.htm

PSPP another data analysis software

2013-04-11T10:55:00.002-07:00

you can download it for free here.

to explore more: Create a pivot table using Python

2013-03-28T20:16:00.001-07:00

#Create a pivot table
        table = spss.BasePivotTable("Group Means",
                                    "OMS table subtype")
        table.Append(spss.Dimension.Place.row,
                     spss.GetVariableLabel(groupIndex))
        table.Append(spss.Dimension.Place.column,
                     spss.GetVariableLabel(sumIndex))

        category2 = spss.CellText.String("Mean")
        for cat in sorted(Counts):
            category1 = spss.CellText.Number(cat)
            table[(category1,category2)] = \
                   spss.CellText.Number(Totals[cat]/Counts[cat])

Source: SPSS online help.

Creating new variables with R

2013-03-22T12:11:00.002-07:00

# Three examples for doing the same computations

       

       mydata$sum <- br="" mydata="" x1="" x2="">
      mydata$mean <- br="" mydata="" x1="" x2="">
      

      attach(mydata)

      mydata$sum <- br="" x1="" x2="">
      mydata$mean <- br="" x1="" x2="">
      detach(mydata)

      

      mydata <- br="" mydata="" transform="">
      sum = x1 + x2,

mean = (x1 + x2)/2 

      )

Source for the above code.

R help with read

2013-03-21T20:57:00.001-07:00

help

Copying value labels from an existing variable...

2013-03-19T14:59:00.000-07:00

OR you could use the values from another variable which has the same labels.

APPLY DICTIONARY

/FROM *

/SOURCE VARIABLES = b9a

/TARGET VARIABLES = occupf

/FILEINFO

/VARINFO ALIGNMENT FORMATS LEVEL MISSING VALLABELS = REPLACE VARLABEL WIDTH .

Missing Values

2012-12-13T13:16:00.001-08:00

You may have used the command MISSING VALUES in a file to declare a certain value(s)

as missing, for example:

MISSING VALUES V1 (8,9)

Here the values 8,9 will be considered missing in the data and will not be included

in computations.

 But, sometimes you want those values back. One way would be to close the file and

reload but that is cumbersome. The short way to do that is:

MISSING VALUES V1 ().

The above command will remove any previously declared values from missing category to the data category.

Non-parametric tests

2012-11-08T12:49:00.003-08:00

Useful reviews of Non-parametric tests

http://www.une.edu.au/WebStat/unit_materials/c6_common_statistical_tests/nonparametric_test.html

http://www.graphpad.com/support/faqid/1790/

http://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/parametric-nonparametric-tests

http://changingminds.org/explanations/research/analysis/parametric_non-parametric.htm

Data Cleaning (draft entry)

2012-08-16T14:19:00.002-07:00

My next post will be about Data Cleaning. I am not the expert on this but I know few things. One simple way to do this is to compare data entered by 2 different people. The command in SPSS is called

UPDATE FILE

Here is an example from UCLA site:

update file = "D:\person1.sav"
/in = flag1
/file = "D:\person2.sav"
/by all.
exe.

More valuable information in this pdf.

http://www.ats.ucla.edu/stat/sas/library/nesug99/ss123.pdf

Need to update this blog!!

Effect Size

2012-08-15T14:53:00.003-07:00

"Statistical significance only tells the researcher how likely it is that an observed finding could have occurred by chance. It does not say anything about magnitude of the effect observed. Effect size is a name given to a group of statistics that measure the magnitude of a treatment effect. In many cases, effect size is a better measure of research outcomes than the significance level. This is because with large samples, one can observe statistically significant group differences even when only a tiny effect is present. Unlike significance tests, effect size indices are independent of sample size." source: http://www.umdnj.edu/idsweb/shared/effect_size.htm

Effect size calculator

another calculator

another calculator

Data/Software/information sources (free)

2012-08-14T12:44:00.003-07:00

This is a loose compilation of sources of meta data/journals/software etc. related to population and health, concerning international issues in general but in particular the USA and Pakistan. I think this can be potentially very useful for graduate students of these two countries.

Asian Barometer "The Asian Barometer (ABS) is an applied research program on public opinion on political values, democracy, and governance around the region. The regional network encompasses research teams from 13 East Asian political systems (Japan, Mongolia, South Koreas, Taiwan, Hong Kong, China, the Philippines, Thailand, Vietnam, Cambodia, Singapore, Indonesia, and Malaysia), and 5 South Asian countries (India, Pakistan, Bangladesh, Sri Lanka, and Nepal)."

Databases/software (free) for social sciences and public health:
http://en.citizendium.org/wiki/Free_statistical_software#_note-idams

Current Population Survey (CPS) Datasets for download (free; SAS format only)

Department of Health and Human Services (HHS) Data Finder

The General Social Survey (GSS) contains a standard 'core' of demographic, behavioral, and attitudinal questions, plus topics of special interest. Many of the core questions have remained unchanged since 1972 to facilitate time-trend studies as well as replication of earlier findings. The GSS takes the pulse of America, and is a unique and valuable resource. It has tracked the opinions of Americans over the last four decades.

Download data (SPSS format) from here.

Univ of Michigan Database of data files
http://www.icpsr.umich.edu/icpsrweb/ICPSR/themes/index.jsp

Princeton university Dataset sources for Pakistan

The NICHD Data and Specimen Hub (DASH) is a centralized resource that allows researchers to share and access de-identified data from studies funded by NICHD. DASH also serves as a portal for requesting biospecimens from selected DASH studies.

The Data Online for Population, Health and Nutrition (DOLPHN) system is an online statistical data resource containing selected current and historical country-level demographic and health indicator data. The DOLPHN system is designed to provide users with quick and easy access to frequently used statistics and can be helpful as both a reference and analytical tool.

Stanford University Data sets (free) http://data.stanford.edu/

Interesting link for PhD students
http://www2.hud.ac.uk/research/gradcentre/links.php

UNICEF/WHO sanitation and water

Open Source Publishing

Jstor Data
http://dfr.jstor.org/

Google and Wiley Interscience

Google public data visualizationThe Google Public Data Explorer makes large datasets easy to explore, visualize and communicate.

Harvard data related to public health
The purpose of this website is to provide public health professionals, researchers, policy makers and students with a comprehensive catalog of Maternal and Child Health (MCH) data sets, interactive tools and other resources.

CDC Wonder
Wide-ranging Online Data for Epidemiologic Research

CDC newborn feeding practices datasets
http://www.cdc.gov/ifps/data/index.htm

CDC datasets on breastfeeding practices:
http://www.cdc.gov/breastfeeding/data/index.htm

The Cochare Library (great for public health publications)
http://www.thecochranelibrary.com/view/0/index.html?gclid=CPWu8trNy6ACFdNA6wodqn_u0A

JHUCCP research tool database
http://new.jhuccp.org/research/researchDB/

Pew Research Center Databases
You can download the data collected by Pew Research Center from here for their various national and international surveys (the religion project includes Pakistan).

PRB Data Finder

Research Gate
Professional network for scientists.

RAND data

A UH student analysis on different meta-data sources:
http://www2.hawaii.edu/~jacso/extra/

UN population data

US Census, international population statistics

World Bank Datasets

World Bank Data

World Values Survey

http://www.ipums.org/ Integrated Public Use Microdata Series from Minnesota University

National Bureau of Economic Research data from diff. sources related to American Demographics and Economics

To ADD (or SUM) in SPSS

2012-07-25T20:31:00.000-07:00

Well, in SPSS you can add a series of variables in two different ways. First is you add two variables i.e., boys and girls and get the total children. Or, you want to create an Index based on a series of scores but want to ignore the respondent who missed out on any of the variables in the series (i.e., there is a MISSING value in 1 or more variables for them).

compute t_child=sons+daughters.

compute t_child=sum (sons, daughters).

"The difference between the two procedures above is that in the first procedure, the case on total would be missing if any one of the four variables had missing values on a case; in the second procedure, the total would be computed while ignoring missing values on the four variables." No cases will be dropped due to a missing value in any of the variables. "Essentially SPSS treats the missing value as ZERO."

In the SUM argument the variables must be separated by comma but if there are multiple variables you can use the option of TO to provide a range. For example, if you want to construct a happiness index based on 12 indicators/variables hap1 thru hap12, you can use the following syntax:

compute happiness=sum (hap1 thru hap12).

Source: Indiana University IT Services and others.

Another point to note is that "the SUM() function is evidently flexible enough to respect more complex statements like SUM(Var1+Var2, Var3-Var4, Var5*Var6). Hence, do not use the addition symbol when you use SUM unless that is part of the list of arguments. Source: SPSSX Discussion group

While talking about the flexibility and greatness of SUM, there is another neat function that you can take note of. So, in case you want to limit the CASE DROPPING based on any MISSING values, you can provide a number to TELL the computer to keep a CASE/RESPONDENT if at least X # of variables are answered. So,

COMPUTE V3 = SUM.2(V1, V2). EXECUTE .

"The .2 appended to the end of the SUM function in the above example can be any integer. Use it to indicate the minimum number of valid cases necessary to perform a given calculation." Source: Indiana University IT Services

Also remember Listwise and pairwise deletion a concept SPSS uses while using addition function. According to a discussion group they are defined as:

Listwise - then if the respondent has any missing value for any variable then the respondent is omitted from all your data analysis.

Pairwise - not as harsh as listwise in that the respondent is dropped only on analyses involving variables that have missing values.

Also check the IBM site and Psychwiki for more on list and pairwise deletion.

Factor Analysis in short (not my writing)

2012-07-24T19:44:00.003-07:00

What is Factor Analysis?*
"Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the number of variables in a model or to detect relationships among variables. All variables involved in the factor analysis need to be interval and are assumed to be normally distributed."

SPSS syntax:

factor
/variables read write math science socst
/criteria factors(2)
/extraction pc
/rotation varimax
/plot eigen.

Here is the syntax in SPSS from ANU course notes:

FACTOR
/VARIABLES q34_1 to q34_12
/MISSING LISTWISE /ANALYSIS q34_1 to q34_12
/PRINT INITIAL KMO REPR EXTRACTION ROTATION
/CRITERIA MINEIGEN(1) ITERATE(25)
/FORMAT SORT
/EXTRACTION PAF
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/METHOD=CORRELATION .

Crate SCALE using FA

FACTOR
/VARIABLES q34_1 to q34_12
/MISSING LISTWISE /ANALYSIS q34_1 to q34_12
/PRINT INITIAL KMO REPR EXTRACTION ROTATION
/CRITERIA MINEIGEN(1) ITERATE(25)
/FORMAT SORT
/EXTRACTION PAF
/PLOT EIGEN
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG (2)
/METHOD=CORRELATION .

*Introduction to SAS. UCLA: Academic Technology Services, Statistical Consulting Group. from http://www.ats.ucla.edu/stat/sas/notes2/ (accessed November 24, 2007).

2012-07-24T19:41:00.000-07:00

Measuring unmet need for family planning.

Preamble
“Millions of women would prefer to avoid becoming pregnant either right away or ever, but are not using contraception. These women have an unmet need for family planning. Programs can serve many of these women by developing strategies that respond directly to their concern.” Ref: Population Reports, Sept 1996.

Unmet need is defined on the basis of women’s responses to survey questions and following are some of the definitions that have been used since 1970’s.

The KAP-Gap
Definition one: Women who wanted to have no more children but were not using contraception. (Ignored spacers, exposure to risk of pregnancy)

The world fertility survey (WFS 1972-1984)
Definition two: Same as above but excluded pregnant and amenorrheic women, because they did not currently need contraception. (Ignored spacers)

In 1981, John Anderson and Leo Morris measured the percentage of women of reproductive age who are “exposed to the risk of unintended pregnancy and are not using contraceptive”. (Included spacers). Next year Nortman and Gary developed a model by including pregnant, breast feeding, or amenorrheic in the definition of unmet need.

After ICPD 1994, Sinding and Fathalla have suggested to measure unmet need more broadly including unmet need among people who are using contraception but may be dissatisfied with their method. By using both qualitative and quantitative data, they suggest experience with sideeffects, discontinuation and other problems of contraception could help extend the focus of unmet need from use of any method to the quality of care.

Arguments over who is at risk, should we include inappropriate method use and method failure. DHS started asking questions on intentions about current pregnancy, therefore, including pregnant women. Recently included category is unmarried women. In short, include all women who are “at risk” of an unintended or mistimed pregnancy.

Considering the importance of measurement of unmet need, now all DHS and FP/RH Survey questionnaire ask about extended definition of unmet need.

Casterline (1997) pointed out that there can be inaccuracies in the reporting of contraceptive use and in the reporting of fertility preferences, and both pieces of information are required for estimating unmet need. Furthermore, his work shows that unmet need is subject to different definitions, and its measurement is not straightforward. Therefore, any survey undertaken for the measurement of unmet need must consider issues of definition in advance.

Following chart shows the standard formulation of unmet need.

Naming multiple variables at the same time, with syntax

2012-07-24T19:36:00.002-07:00

Of course, it has to be with SYNTAX... I like to do everything with Syntax because of so many reasons but mostly to keep a log of what I am doing and secondly to reduce the key strokes I have to make for repeated jobs!!!

So, in case you have some variables for which you have to assign VALUE AND VARIABLE LABELS wouldn't it be handy to if you are able to do them with one command.I know it is a small thing and most people who use SPSS would laugh at me for even writing a blog entry on this, but believe me it is easy to forget little things especially if you go out of touch for a year or two. So, here is the command:

VAR LAB

q29a '(RHC)Number of hours facility open for consultantion'
/q29b 'Number of hours facility open for consultantion BHU'
/q29c 'Number of hours facility open for consultantion MCH center'
/q29d 'Number of hours facility open for consultantion Dispensary'
/q29e 'Number of hours facility open for consultantion govt hospital'
/q29f 'Number of hours facility open for consultantion Pvt hospital'
/q29g 'Number of hours facility open for consultantion Dispensary/Compoder'
/q29h 'Number of hours facility open for consultantion Nurse/LHV'
/q29i 'Number of hours facility open for consultantion Hakeem/ Homeopath'
/q29j 'Number of hours facility open for consultantion FWC'
/q29k 'Number of hours facility open for consultantion 'RHS-A'
/q29l '(Others)Number of hours facility open for consultantion'.

Please note in the above syntax, After VAR LAB the first variable name is written as is, but the rest precede with a backslash "/".

Data transfer (migration) from Access to SPSS

2012-07-24T19:31:00.000-07:00

Problem:

Your data has been entered in MS Access, where all the variables (fields) have been defined names, width, type etc. and there are look up arrays/tables linked with each of the fields to describe the Response Values. But then you need to run some stats in SPSS. So, you basically EXPORT the file to some data analysis software like SPSS. One way to do it is export to MS Excel format and open/Import the Excel sheet into SPSS, which is pretty straightforward and simple. But then you examine the file and you will notice that in this transition, all the nifty labels of fields and values are gone and you will have to either make guesses or look at your data collection instruments to make sense of the numbers.

So, the choice you have is either to keep doing that or manually assign all the labels in SPSS. It is fine if you have only a handful of variables, but if you have a long list it is a lot of work!!!

What do you do? I have been trying to get around this problem for months now with no success. There is a Script on my favorite SPSS site:
http://www.spsstools.net/Scripts/ImportExport/ExportLabelsFromAccessToSPSS.txt

which should do the needful but I am certainly not doing it right. Need help. All the google search and various discussion groups have proven to be of no use also. Apparently I can create a link through ODBC but I am too lame to figure that out...

Any help?

Update 1: No luck with VB or Python or ODBC etc. because I am too dumb to learn them on my own! However, I learnt that if you have to do that a lot, there is a handy program that can do it for you. It is called Stat/Transfer. One caveat is that it is not FREE. The student version costs $59. In future I would buy it if I am stuck with multiple transfers between various databases. In the past they also used to have DBMS copy for such things but it does not exist anymore. I have tried to search for a Open Source version for Stat/Transfer but no luck yet!!

Entering quantitative data into the computer

2012-01-26T00:29:00.000-08:00

Well there are many many ways you can enter the quantitative data (generated from surveys) into a computer. Most of the people I know use either use a simple Excel table or very very complex MS Access data entry file. However both of these programs are not created for data entry purposes and hence are not so handy for that purpose.

There are several software out there in the market which you can buy to do the job, or you can outsource the data for entry or create a program for the very purpose yourself. All of these options require extra resources. SPSS, the most popular data analysis program also has a data entry module which is great but that is not free.

The best option would be to have a data entry program which is made just for that purpose, is easy to learn and use and most importantly is free. Thankfully there are 2 such software:

1) CSPro
"The Census and Survey Processing System (CSPro) is a public domain software package used by hundreds of organizations and tens of thousands of individuals for entering, editing, tabulating, and disseminating census and survey data."

2) EpiData
"EpiData Entry is used for simple or programmed data entry and data documentation. Entry handles simple forms or related systems Optimised documentation."

In my view using using Excel and Access have their strength but for any survey with more than a page long questionnaire and more than 40-50 respondents, it is better to use one or the other of the above.

Range checks, value lengths, logical jumps, automatically calculated fields, data validation on double entry, find duplicates, assign values for missing or not applicable values etc. are some the things that needs to be done to maintain the integrity of data. Both of the above software can do that.

The files that are created in these programs can easily be exported to major analytical packages like SPSS and STATA.

Converting a table to Data File

2012-01-12T20:28:00.000-08:00

I have been looking for this for so long.....

Thank you Mr. John Walkenbach

Here is a nifty Macro in Excel to do that!!!

Sub ReversePivotTable()
'   Before running this, make sure you have a summary table with column headers.
'   The output table will have three columns.    
Dim SummaryTable As Range, OutputRange As Range    
Dim OutRow As Long    
Dim r As Long, c As Long     
On Error Resume Next    
Set SummaryTable = ActiveCell.CurrentRegion    
If SummaryTable.Count = 1 Or SummaryTable.Rows.Count < 3 Then        
MsgBox "Select a cell within the summary table.", vbCritical        
Exit Sub    
End If    
SummaryTable.Select    
Set OutputRange = Application.InputBox(prompt:="Select a cell for the 3-column output", Type:=8) '  
Convert the range     OutRow = 2     Application.ScreenUpdating = False    
OutputRange.Range("A1:C3") = Array("Column1", "Column2", "Column3")    
For r = 2 To SummaryTable.Rows.Count        
For c = 2 To SummaryTable.Columns.Count            
OutputRange.Cells(OutRow, 1) = SummaryTable.Cells(r, 1)            
OutputRange.Cells(OutRow, 2) = SummaryTable.Cells(1, c)            
OutputRange.Cells(OutRow, 3) = SummaryTable.Cells(r, c)            
OutputRange.Cells(OutRow, 3).NumberFormat = SummaryTable.Cells(r, c).NumberFormat            
OutRow = OutRow + 1       
Next c  
 Next r
End Sub

VlookUp in Excel

2012-01-12T13:43:00.000-08:00

Learnt a lot about VlookUp today. It is really one of the most powerful features of MS Excel.

Few things to remember.
The formula will return a "0" (without quotes) if the index variable was matched but there as not corresponding value to return. If the index variable does not match, Excel will return #NA.

-remember the column number you want to retrieve. This column number is counted from the column where you "looked" for the index variable.

-You can lookup in another file.
-Linked files get updated even if they are not open.
-To keep track of what changes has been made it is a good idea to use the "TRACK CHANGES" feature.
-Excel cares about trailing spaces but does not care for difference in CASE and the type of the cell.
-Cells cannot be protected, hence you have to be very careful when you use vlookup.

Example of VLookup with data in 2 different sheets.

=VLOOKUP(A:A,Sheet3!A:B,2,FALSE)

Another one,
=VLOOKUP(O2,B2:K41,5)

In this example O2 (the first condition) is the address of the value that you want to MATCH, for example registration number, ssn etc. (the index) with the source data file. You can also provide a value here, but giving a cell address makes it easier to copy a range. The second term, B2:K41 is the range of source data in which the first column includes your INDEX variable. The third value, 5 is the column number after the index. For example if your INDEX is in column B and your VALUE (ie name) is in column E, you should write 4 here). Make sure to remember that you need to select the complete range of columns for the lookup. For example, if you have school names in column A, freq of particular symptom in column B and you want to get the total number of health room visits from columns E (while D has the corresponding school name), you need to select both column D and E for the second condition (not just one column), then put 1 for the index number.

Comparing/Updating multiple files against a Master file

2012-01-12T13:41:00.000-08:00

UPDATE FILE command can be used for that. More text to add later.

Meanwhile check this page.
http://www.ats.ucla.edu/stat/spss/faq/update.htm

Excel oddities

2012-01-10T13:13:00.000-08:00

Check this page for some oddities and quirks related to Excel.

http://spreadsheetpage.com/index.php/oddities